JFYI: ext4 bug triggerable by kvm

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* JFYI: ext4 bug triggerable by kvm
@ 2010-08-16 14:00 Michael Tokarev
  2010-08-16 14:43 ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Michael Tokarev @ 2010-08-16 14:00 UTC (permalink / raw)
  To: KVM list; +Cc: Kevin Wolf

https://bugzilla.kernel.org/show_bug.cgi?id=16165

When a (raw) guest image is placed on an ext4 filesystem,
it is possible to get data corruption, now due to ext4
bug, not kvm bug.

Also, ext4 is _very_ slow on O_SYNC writes (which is
used in kvm with default cache).

JFYI.

/mjt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-16 14:00 JFYI: ext4 bug triggerable by kvm Michael Tokarev
@ 2010-08-16 14:43 ` Anthony Liguori
  2010-08-16 18:42   ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2010-08-16 14:43 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: KVM list, Kevin Wolf

On 08/16/2010 09:00 AM, Michael Tokarev wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=16165
>
> When a (raw) guest image is placed on an ext4 filesystem,
> it is possible to get data corruption, now due to ext4
> bug, not kvm bug.
>    

Yeah, there appears to be a few O_DIRECT related issues with ext4.  
AFAIK, a preallocated raw image should be safe though which is probably 
the only time you should use O_DIRECT.

> Also, ext4 is _very_ slow on O_SYNC writes (which is
> used in kvm with default cache).
>    

Yeah, we probably need to switch to sync_file_range() to avoid the 
journal commit on every write.

Regards,

Anthony Liguori

> JFYI.
>
> /mjt
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-16 14:43 ` Anthony Liguori
@ 2010-08-16 18:42   ` Christoph Hellwig
  2010-08-16 20:34     ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-16 18:42 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
> >Also, ext4 is _very_ slow on O_SYNC writes (which is
> >used in kvm with default cache).
> 
> Yeah, we probably need to switch to sync_file_range() to avoid the
> journal commit on every write.
> 

No, we don't.  sync_file_range does not actually provide any data
integrity.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-16 18:42   ` Christoph Hellwig
@ 2010-08-16 20:34     ` Anthony Liguori
  2010-08-17  9:07       ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2010-08-16 20:34 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On 08/16/2010 01:42 PM, Christoph Hellwig wrote:
> On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
>    
>>> Also, ext4 is _very_ slow on O_SYNC writes (which is
>>> used in kvm with default cache).
>>>        
>> Yeah, we probably need to switch to sync_file_range() to avoid the
>> journal commit on every write.
>>
>>      
> No, we don't.  sync_file_range does not actually provide any data
> integrity.
>    

What do you mean by data integrity?

For each write in cache=writethrough, we don't have to ensure the data 
is on the platter.   We really just need to ensure that the data has 
been sent to next level in the storage hierarchy and that it has been 
acknowledged as having been written.  We don't need to actually inject a 
barrier.

My understanding is that on ext4/btrfs, an O_SYNC write injects a 
barrier for every write which is not the behavior we're looking for.  As 
I understand it, sync_file_range() would give us the above guarantee 
without the barrier and for explicit barriers, we would use fsync.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-16 20:34     ` Anthony Liguori
@ 2010-08-17  9:07       ` Christoph Hellwig
  2010-08-17  9:23         ` Avi Kivity
  2010-08-17 12:56         ` Anthony Liguori
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17  9:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, Michael Tokarev, KVM list, Kevin Wolf

On Mon, Aug 16, 2010 at 03:34:12PM -0500, Anthony Liguori wrote:
> On 08/16/2010 01:42 PM, Christoph Hellwig wrote:
> >On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
> >>>Also, ext4 is _very_ slow on O_SYNC writes (which is
> >>>used in kvm with default cache).
> >>Yeah, we probably need to switch to sync_file_range() to avoid the
> >>journal commit on every write.
> >>
> >No, we don't.  sync_file_range does not actually provide any data
> >integrity.
> 
> What do you mean by data integrity?

sync_file_range only does pagecache-level writeout of the file data.
It nevers calls into the actual filesystem, that means any block
allocations (for filling holes / converting preallocated space in normal
filesystems, or every write in COW-based filesstems like qcow2) never
get flushes to disk, and even more importantly the disk write cache is
never flushed.

In short it's completely worthless for any real filesystem.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17  9:07       ` Christoph Hellwig
@ 2010-08-17  9:23         ` Avi Kivity
  2010-08-17 11:17           ` Christoph Hellwig
  2010-08-17 12:56         ` Anthony Liguori
  1 sibling, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2010-08-17  9:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Anthony Liguori, Michael Tokarev, KVM list, Kevin Wolf

  On 08/17/2010 12:07 PM, Christoph Hellwig wrote:
>
> In short it's completely worthless for any real filesystem.
>

The documentation should be updated then.  It suggests that it is usable 
for data integrity.

(or maybe, it should be fixed?)

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17  9:23         ` Avi Kivity
@ 2010-08-17 11:17           ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 11:17 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Anthony Liguori, Michael Tokarev, KVM list,
	Kevin Wolf

On Tue, Aug 17, 2010 at 12:23:01PM +0300, Avi Kivity wrote:
>  On 08/17/2010 12:07 PM, Christoph Hellwig wrote:
> >
> >In short it's completely worthless for any real filesystem.
> >
> 
> The documentation should be updated then.  It suggests that it is
> usable for data integrity.

The manpage has a "warning" section documenting what I said above since
I added it in January.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17  9:07       ` Christoph Hellwig
  2010-08-17  9:23         ` Avi Kivity
@ 2010-08-17 12:56         ` Anthony Liguori
  2010-08-17 13:07           ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2010-08-17 12:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On 08/17/2010 04:07 AM, Christoph Hellwig wrote:
> On Mon, Aug 16, 2010 at 03:34:12PM -0500, Anthony Liguori wrote:
>    
>> On 08/16/2010 01:42 PM, Christoph Hellwig wrote:
>>      
>>> On Mon, Aug 16, 2010 at 09:43:09AM -0500, Anthony Liguori wrote:
>>>        
>>>>> Also, ext4 is _very_ slow on O_SYNC writes (which is
>>>>> used in kvm with default cache).
>>>>>            
>>>> Yeah, we probably need to switch to sync_file_range() to avoid the
>>>> journal commit on every write.
>>>>
>>>>          
>>> No, we don't.  sync_file_range does not actually provide any data
>>> integrity.
>>>        
>> What do you mean by data integrity?
>>      
> sync_file_range only does pagecache-level writeout of the file data.
> It nevers calls into the actual filesystem, that means any block
> allocations (for filling holes / converting preallocated space in normal
> filesystems, or every write in COW-based filesstems like qcow2) never
> get flushes to disk,

But assuming that you had a preallocated disk image, it would 
effectively flush the page cache so it sounds like the only real issue 
is sparse and growable files.

>   and even more importantly the disk write cache is
> never flushed.
>    

The point is that we don't want to flush the disk write cache.  The 
intention of writethrough is not to make the disk cache writethrough but 
to treat the host's cache as writethrough.

Regards,

Anthony Liguori

> In short it's completely worthless for any real filesystem.
>
>    


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 12:56         ` Anthony Liguori
@ 2010-08-17 13:07           ` Christoph Hellwig
  2010-08-17 14:20             ` Anthony Liguori
  0 siblings, 1 reply; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 13:07 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, Michael Tokarev, KVM list, Kevin Wolf

On Tue, Aug 17, 2010 at 07:56:04AM -0500, Anthony Liguori wrote:
> But assuming that you had a preallocated disk image, it would
> effectively flush the page cache so it sounds like the only real
> issue is sparse and growable files.

For preallocated as in using fallocate() we still converting unwritten
to regular extents and do have metadata updates.  For preallocated as
in writining zeroes into the whole image earlier we do indeed only
care about the data, and will not have metadata for most filesystems.
That still leaves COW based filesystems that need to allocate new blocks
on every write, and from my reading NFS also needs the ->fsync callout
to actually commit unstable data to disk.

> >  and even more importantly the disk write cache is
> >never flushed.
> 
> The point is that we don't want to flush the disk write cache.  The
> intention of writethrough is not to make the disk cache writethrough
> but to treat the host's cache as writethrough.

We need to make sure data is not in the disk write cache if want to
provide data integrity.  It has nothing to do with the qemu caching
mode - for data=writeback or none it's commited as part of the fdatasync
call, and for data=writethrough it's commited as part of the O_SYNC
write.  Note that both these path end up calling the filesystems ->fsync
method which is what's require to make writes stable.  That's exactly
what is missing out in sync_file_range, and that's why that API is not
useful at all for data integrity operations.  It's also what makes
fsync slow on extN - but the fix to that is not to not provide data
integrity but rather to make fsync fast.  There's various other
filesystems that can already do it, and if you insist on using those
that are slow for this operation you'll have to suffer until that
issue is fixed for them.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 13:07           ` Christoph Hellwig
@ 2010-08-17 14:20             ` Anthony Liguori
  2010-08-17 14:28               ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2010-08-17 14:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On 08/17/2010 08:07 AM, Christoph Hellwig wrote:
>> The point is that we don't want to flush the disk write cache.  The
>> intention of writethrough is not to make the disk cache writethrough
>> but to treat the host's cache as writethrough.
>>      
>
> We need to make sure data is not in the disk write cache if want to
> provide data integrity.

When the guest explicitly flushes the emulated disk's write cache.  Not 
on every single write completion.

>    It has nothing to do with the qemu caching
> mode - for data=writeback or none it's commited as part of the fdatasync
> call, and for data=writethrough it's commited as part of the O_SYNC
> write.  Note that both these path end up calling the filesystems ->fsync
> method which is what's require to make writes stable.  That's exactly
> what is missing out in sync_file_range, and that's why that API is not
> useful at all for data integrity operations.

For normal writes from a guest, we don't need to follow the write with 
an fsync().  We should only need to issue an fsync() given an explicit 
flush from the guest.

>    It's also what makes
> fsync slow on extN - but the fix to that is not to not provide data
> integrity but rather to make fsync fast.  There's various other
> filesystems that can already do it, and if you insist on using those
> that are slow for this operation you'll have to suffer until that
> issue is fixed for them.
>    

fsync() being slow is orthogonal to my point.  I don't see why we need 
to do an fsync() on *every* write.  It should only be necessary when a 
guest injects an actual barrier.

Regards,

Anthony Liguori



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:20             ` Anthony Liguori
@ 2010-08-17 14:28               ` Christoph Hellwig
  2010-08-17 14:39                 ` Anthony Liguori
  2010-08-17 14:40                 ` Michael Tokarev
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 14:28 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, Michael Tokarev, KVM list, Kevin Wolf

On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote:
> On 08/17/2010 08:07 AM, Christoph Hellwig wrote:
> >>The point is that we don't want to flush the disk write cache.  The
> >>intention of writethrough is not to make the disk cache writethrough
> >>but to treat the host's cache as writethrough.
> >
> >We need to make sure data is not in the disk write cache if want to
> >provide data integrity.
> 
> When the guest explicitly flushes the emulated disk's write cache.
> Not on every single write completion.

That depends on the cache= mode.  For cache=none and cache=writeback
we present a write-back cache to the guest, and the guest does explicit
cache flushes.  For cache=writethrough we present a writethrough cache
to the guest, and we need to make sure data actually has hit the disk
before returning I/O completion to the guest.

> >   It has nothing to do with the qemu caching
> >mode - for data=writeback or none it's commited as part of the fdatasync
> >call, and for data=writethrough it's commited as part of the O_SYNC
> >write.  Note that both these path end up calling the filesystems ->fsync
> >method which is what's require to make writes stable.  That's exactly
> >what is missing out in sync_file_range, and that's why that API is not
> >useful at all for data integrity operations.
> 
> For normal writes from a guest, we don't need to follow the write
> with an fsync().  We should only need to issue an fsync() given an
> explicit flush from the guest.

Define normal writes.  For cache=none and cache=writeback we don't
have to, and instead do explicit calls to fsync()/fdatasync() calls
when a we a cache flush from the guest.  For data=writethrough we
guarantee data has made it to disk, and we implement this using
O_DSYNC/O_SYNC when opening the file.  That tells the operating system
to not return until data has hit the disk.   For Linux this is
internally implement using a range-fsync/fdatasync after the actual
write.

> fsync() being slow is orthogonal to my point.  I don't see why we
> need to do an fsync() on *every* write.  It should only be necessary
> when a guest injects an actual barrier.

See above.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:28               ` Christoph Hellwig
@ 2010-08-17 14:39                 ` Anthony Liguori
  2010-08-17 14:45                   ` Christoph Hellwig
  2010-08-17 14:40                 ` Michael Tokarev
  1 sibling, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2010-08-17 14:39 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On 08/17/2010 09:28 AM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote:
>    
>> On 08/17/2010 08:07 AM, Christoph Hellwig wrote:
>>      
>>>> The point is that we don't want to flush the disk write cache.  The
>>>> intention of writethrough is not to make the disk cache writethrough
>>>> but to treat the host's cache as writethrough.
>>>>          
>>> We need to make sure data is not in the disk write cache if want to
>>> provide data integrity.
>>>        
>> When the guest explicitly flushes the emulated disk's write cache.
>> Not on every single write completion.
>>      
> That depends on the cache= mode.  For cache=none and cache=writeback
> we present a write-back cache to the guest, and the guest does explicit
> cache flushes.  For cache=writethrough we present a writethrough cache
> to the guest, and we need to make sure data actually has hit the disk
> before returning I/O completion to the guest.
>    

Why?

The type of cache we present to the guest only should relate to how the 
hypervisor caches the storage.  It should be independent of how data is 
cached by the disk.

There can be many levels of caching in a storage hierarchy and each 
hierarchy cached independently of the next level.

If the user has a disk with a writeback cache, if we expose a 
writethrough cache to the guest, it's not our responsibility to make 
sure that we break through the writeback cache on the disk.

Regards,

Anthony Liguori


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:28               ` Christoph Hellwig
  2010-08-17 14:39                 ` Anthony Liguori
@ 2010-08-17 14:40                 ` Michael Tokarev
  2010-08-17 14:44                   ` Anthony Liguori
  1 sibling, 1 reply; 23+ messages in thread
From: Michael Tokarev @ 2010-08-17 14:40 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Anthony Liguori, KVM list, Kevin Wolf

17.08.2010 18:28, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 09:20:37AM -0500, Anthony Liguori wrote:
[]
>> For normal writes from a guest, we don't need to follow the write
>> with an fsync().  We should only need to issue an fsync() given an
>> explicit flush from the guest.
> 
> Define normal writes.  For cache=none and cache=writeback we don't
> have to, and instead do explicit calls to fsync()/fdatasync() calls
> when a we a cache flush from the guest.  For data=writethrough we
> guarantee data has made it to disk, and we implement this using
> O_DSYNC/O_SYNC when opening the file.  That tells the operating system
> to not return until data has hit the disk.   For Linux this is
> internally implement using a range-fsync/fdatasync after the actual
> write.

And this is actually what I mentioned in the very beginning,
in a hopefully-single-thread-email I've sent.  Mentioned
that ext4 is very slow when using with O_SYNC (without O_DIRECT).

I still had no opportunity to collect more info on this, and
yes, I've seen your (Christoph's) speed tests of a few FSes
in the famous "BTRFS: Unbelievably slow with kvm/qemu" thread.
A few users reported _insane_ write speeds of qcow2 files
with default cache mode on ext4.

And this is what prompted all this discussion (which actually
has nothing to do with the $subject line ;), -- an attempt
to think about replacing O_SYNC/fsync() with something
"lighter"...

>> fsync() being slow is orthogonal to my point.  I don't see why we
>> need to do an fsync() on *every* write.  It should only be necessary
>> when a guest injects an actual barrier.

We don't do sync on every write, but O_SYNC implies that.
And apparently it is what happening behind the scenes in
ext4 O_SYNC case.

But ok....

/mjt

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:40                 ` Michael Tokarev
@ 2010-08-17 14:44                   ` Anthony Liguori
  2010-08-17 14:46                     ` Christoph Hellwig
  0 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2010-08-17 14:44 UTC (permalink / raw)
  To: Michael Tokarev; +Cc: Christoph Hellwig, KVM list, Kevin Wolf

On 08/17/2010 09:40 AM, Michael Tokarev wrote:
>
>>> fsync() being slow is orthogonal to my point.  I don't see why we
>>> need to do an fsync() on *every* write.  It should only be necessary
>>> when a guest injects an actual barrier.
>>>        
> We don't do sync on every write, but O_SYNC implies that.
> And apparently it is what happening behind the scenes in
> ext4 O_SYNC case.
>    

I think the real issue is we're mixing host configuration with guest 
visible state.

With O_SYNC, we're causing cache=writethrough to do writethrough through 
two layers of the storage heirarchy.  I don't think that's necessary or 
desirable though.

Regards,

Anthony Liguori

> But ok....
>
> /mjt
>    


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:39                 ` Anthony Liguori
@ 2010-08-17 14:45                   ` Christoph Hellwig
  2010-08-17 14:53                     ` Avi Kivity
  2010-08-17 14:54                     ` Anthony Liguori
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 14:45 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, Michael Tokarev, KVM list, Kevin Wolf

On Tue, Aug 17, 2010 at 09:39:15AM -0500, Anthony Liguori wrote:
> The type of cache we present to the guest only should relate to how
> the hypervisor caches the storage.  It should be independent of how
> data is cached by the disk.

It is.

> There can be many levels of caching in a storage hierarchy and each
> hierarchy cached independently of the next level.
> 
> If the user has a disk with a writeback cache, if we expose a
> writethrough cache to the guest, it's not our responsibility to make
> sure that we break through the writeback cache on the disk.

The users doesn't know or have to care about the caching.  The
users uses O_SYNC/fsync to tell it wants data on disk, and it's the
operating systems job to make that happen.   The situation with qemu
is the same - if we tell the guest that we do not have a volatile write
cache that needs explicit management the guest can rely on the fact
that it does not have to do manual cache management.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:44                   ` Anthony Liguori
@ 2010-08-17 14:46                     ` Christoph Hellwig
  2010-08-17 14:57                       ` Anthony Liguori
  2010-08-17 14:59                       ` Avi Kivity
  0 siblings, 2 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 14:46 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Michael Tokarev, Christoph Hellwig, KVM list, Kevin Wolf

On Tue, Aug 17, 2010 at 09:44:49AM -0500, Anthony Liguori wrote:
> I think the real issue is we're mixing host configuration with guest
> visible state.

The last time I proposed to decouple the two you and Avi were heavily
opposed to it..

> With O_SYNC, we're causing cache=writethrough to do writethrough
> through two layers of the storage heirarchy.  I don't think that's
> necessary or desirable though.

It's absolutely nessecary if we tell the guest that we do not have
a volatile write cache.  Which is the only good reason to use
data=writethrough anyway - except for dealing with old guests that
can't handle volatile writecache it's an absolutely stupid mode of
operation.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:45                   ` Christoph Hellwig
@ 2010-08-17 14:53                     ` Avi Kivity
  2010-08-17 14:54                     ` Anthony Liguori
  1 sibling, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2010-08-17 14:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Anthony Liguori, Michael Tokarev, KVM list, Kevin Wolf

  On 08/17/2010 05:45 PM, Christoph Hellwig wrote:
>
> The users doesn't know or have to care about the caching.  The
> users uses O_SYNC/fsync to tell it wants data on disk, and it's the
> operating systems job to make that happen.   The situation with qemu
> is the same - if we tell the guest that we do not have a volatile write
> cache that needs explicit management the guest can rely on the fact
> that it does not have to do manual cache management.
>

In the general case this is correct, however sometimes we want to 
explicitly lie (cache=unsafe, or say that we have a write-back cache 
when we don't to preserve the guest's view of things after a migration).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:45                   ` Christoph Hellwig
  2010-08-17 14:53                     ` Avi Kivity
@ 2010-08-17 14:54                     ` Anthony Liguori
  2010-08-17 15:01                       ` Avi Kivity
  2010-08-17 15:02                       ` Christoph Hellwig
  1 sibling, 2 replies; 23+ messages in thread
From: Anthony Liguori @ 2010-08-17 14:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On 08/17/2010 09:45 AM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 09:39:15AM -0500, Anthony Liguori wrote:
>    
>> The type of cache we present to the guest only should relate to how
>> the hypervisor caches the storage.  It should be independent of how
>> data is cached by the disk.
>>      
> It is.
>
>    
>> There can be many levels of caching in a storage hierarchy and each
>> hierarchy cached independently of the next level.
>>
>> If the user has a disk with a writeback cache, if we expose a
>> writethrough cache to the guest, it's not our responsibility to make
>> sure that we break through the writeback cache on the disk.
>>      
> The users doesn't know or have to care about the caching.  The
> users uses O_SYNC/fsync to tell it wants data on disk, and it's the
> operating systems job to make that happen.   The situation with qemu
> is the same - if we tell the guest that we do not have a volatile write
> cache that needs explicit management the guest can rely on the fact
> that it does not have to do manual cache management.
>    

This is simply unrealistic.  O_SYNC might force data to be on a platter 
when using a directly attached disk but many NAS's actually do writeback 
caching and relying on having an UPS to preserve data integrity.  
There's really no way in the general case to ensure that data is 
actually on a platter once you've involved a complex storage setup or 
you assume FUA

Let me put it another way.  If an admin knows the disks on a machine 
have battery backed cache, he's likely to leave writeback caching enabled.

We are currently giving the admin two choices with QEMU, either ignore 
the fact that the disk is battery backed and do write through caching of 
the disk or do writeback caching in the host which expands the disk 
cache from something very small and non-volatile (the on-disk cache) to 
something very large and volatile (the page cache).  To make the page 
cache non-volatile, you would need to have an UPS for the hypervisor 
with enough power to flush the page cache.

So basically, we're not presenting a model that makes sensible use of 
reliable disks.

cache=none does the right thing here but doesn't benefit from the host's 
page cache for reads.  This is really the missing behavior.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:46                     ` Christoph Hellwig
@ 2010-08-17 14:57                       ` Anthony Liguori
  2010-08-17 14:59                       ` Avi Kivity
  1 sibling, 0 replies; 23+ messages in thread
From: Anthony Liguori @ 2010-08-17 14:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Michael Tokarev, KVM list, Kevin Wolf

On 08/17/2010 09:46 AM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 09:44:49AM -0500, Anthony Liguori wrote:
>    
>> I think the real issue is we're mixing host configuration with guest
>> visible state.
>>      
> The last time I proposed to decouple the two you and Avi were heavily
> opposed to it..
>
>    
>> With O_SYNC, we're causing cache=writethrough to do writethrough
>> through two layers of the storage heirarchy.  I don't think that's
>> necessary or desirable though.
>>      
> It's absolutely nessecary if we tell the guest that we do not have
> a volatile write cache.  Which is the only good reason to use
> data=writethrough anyway - except for dealing with old guests that
> can't handle volatile writecache it's an absolutely stupid mode of
> operation.
>    

You can lose an awful lot of data with cache=writeback because the host 
page cache is volatile.  In a perfect world, this would only be 
non-critical data because everyone would be using fsync() properly but 
1) even non-critical data is important when there's a lot of it 2) we 
don't live in a perfect world.  The fact of the matter is, there is a 
huge amount of crappy filesystems and applications today that don't 
submit barriers appropriately.

We make the situation much worse with virtualization because of the 
shear size of the cache we introduce.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:46                     ` Christoph Hellwig
  2010-08-17 14:57                       ` Anthony Liguori
@ 2010-08-17 14:59                       ` Avi Kivity
  2010-08-17 15:04                         ` Christoph Hellwig
  1 sibling, 1 reply; 23+ messages in thread
From: Avi Kivity @ 2010-08-17 14:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Anthony Liguori, Michael Tokarev, KVM list, Kevin Wolf

  On 08/17/2010 05:46 PM, Christoph Hellwig wrote:
> On Tue, Aug 17, 2010 at 09:44:49AM -0500, Anthony Liguori wrote:
>> I think the real issue is we're mixing host configuration with guest
>> visible state.
> The last time I proposed to decouple the two you and Avi were heavily
> opposed to it..

I wasn't that I can recall.

>> With O_SYNC, we're causing cache=writethrough to do writethrough
>> through two layers of the storage heirarchy.  I don't think that's
>> necessary or desirable though.
> It's absolutely nessecary if we tell the guest that we do not have
> a volatile write cache.  Which is the only good reason to use
> data=writethrough anyway - except for dealing with old guests that
> can't handle volatile writecache it's an absolutely stupid mode of
> operation.

I agree, but there's another case: tell the guest that we have a write 
cache, use O_DSYNC, but only flush the disk cache on guest flushes.

The reason for this is that if we don't use O_DSYNC the page cache can 
grow to huge proportions.  While this is allowed by the contract between 
virtual drive and guest, guest software and users won't expect a huge 
data loss on power fail, only a minor data loss from the last fraction 
of a second before the failure.

I believe this can be approximated by mounting the host filesystem with 
barrier=0?

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:54                     ` Anthony Liguori
@ 2010-08-17 15:01                       ` Avi Kivity
  2010-08-17 15:02                       ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Avi Kivity @ 2010-08-17 15:01 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, Michael Tokarev, KVM list, Kevin Wolf

  On 08/17/2010 05:54 PM, Anthony Liguori wrote:
>
> This is simply unrealistic.  O_SYNC might force data to be on a 
> platter when using a directly attached disk but many NAS's actually do 
> writeback caching and relying on having an UPS to preserve data 
> integrity.  There's really no way in the general case to ensure that 
> data is actually on a platter once you've involved a complex storage 
> setup or you assume FUA

That's fine.  Memory backed up by a UPS is a disk platter as far as the 
user is concerned, if the NAS is reliable.

>
> Let me put it another way.  If an admin knows the disks on a machine 
> have battery backed cache, he's likely to leave writeback caching 
> enabled.

In this case, as far as the host is concerned, there is no cache.  Data 
written is guaranteed to reach the disk eventually even without a 
flush.  Hopefully the disk advertises itself as not having a volatile cache.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:54                     ` Anthony Liguori
  2010-08-17 15:01                       ` Avi Kivity
@ 2010-08-17 15:02                       ` Christoph Hellwig
  1 sibling, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 15:02 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Christoph Hellwig, Michael Tokarev, KVM list, Kevin Wolf

On Tue, Aug 17, 2010 at 09:54:07AM -0500, Anthony Liguori wrote:
> This is simply unrealistic.  O_SYNC might force data to be on a
> platter when using a directly attached disk but many NAS's actually
> do writeback caching and relying on having an UPS to preserve data
> integrity.  There's really no way in the general case to ensure that
> data is actually on a platter once you've involved a complex storage
> setup or you assume FUA

Yes, there is.  If you have an array that has batter backup it handles
this internally.  The normal case is to not set the WCE bit in the
mode page, which tells the operating system not ever send
SYNCHRONIZE_CACHE commands.  I have one array that sets a WCE bit
neveless, but it also doesn't flush it's non-volatile cache in
SYNCHRONIZE_CACHE, but rather implements it as an effective no-op.

> Let me put it another way.  If an admin knows the disks on a machine
> have battery backed cache, he's likely to leave writeback caching
> enabled.
> 
> We are currently giving the admin two choices with QEMU, either
> ignore the fact that the disk is battery backed and do write through
> caching of the disk or do writeback caching in the host which

Again, this is not qemu's business at all.  Qemu is not different from
any other application requiring data integrity.  If that admin really
thinks he needs to overide the storage provided settings he can
mount the filesystem using -o nobarrier and we will not send cache
flushes.  I would in general recommend against this, as an external
UPS still has lots of failure modes that this doesn't account for.
Arrays with internal non-volatile memory already do the right thing
anyway.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: JFYI: ext4 bug triggerable by kvm
  2010-08-17 14:59                       ` Avi Kivity
@ 2010-08-17 15:04                         ` Christoph Hellwig
  0 siblings, 0 replies; 23+ messages in thread
From: Christoph Hellwig @ 2010-08-17 15:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Christoph Hellwig, Anthony Liguori, Michael Tokarev, KVM list,
	Kevin Wolf

On Tue, Aug 17, 2010 at 05:59:07PM +0300, Avi Kivity wrote:
> I agree, but there's another case: tell the guest that we have a
> write cache, use O_DSYNC, but only flush the disk cache on guest
> flushes.

O_DSYNC flushes the disk write cache and any filesystem that supports
non-volatile cache.   The disk cache is not an abstraction
exposed to applications.

> I believe this can be approximated by mounting the host filesystem
> with barrier=0?

Mounting the host filesystem with nobarrier means we will never explicit
flush the volatile write cache on the disk.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2010-08-17 15:04 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-16 14:00 JFYI: ext4 bug triggerable by kvm Michael Tokarev
2010-08-16 14:43 ` Anthony Liguori
2010-08-16 18:42   ` Christoph Hellwig
2010-08-16 20:34     ` Anthony Liguori
2010-08-17  9:07       ` Christoph Hellwig
2010-08-17  9:23         ` Avi Kivity
2010-08-17 11:17           ` Christoph Hellwig
2010-08-17 12:56         ` Anthony Liguori
2010-08-17 13:07           ` Christoph Hellwig
2010-08-17 14:20             ` Anthony Liguori
2010-08-17 14:28               ` Christoph Hellwig
2010-08-17 14:39                 ` Anthony Liguori
2010-08-17 14:45                   ` Christoph Hellwig
2010-08-17 14:53                     ` Avi Kivity
2010-08-17 14:54                     ` Anthony Liguori
2010-08-17 15:01                       ` Avi Kivity
2010-08-17 15:02                       ` Christoph Hellwig
2010-08-17 14:40                 ` Michael Tokarev
2010-08-17 14:44                   ` Anthony Liguori
2010-08-17 14:46                     ` Christoph Hellwig
2010-08-17 14:57                       ` Anthony Liguori
2010-08-17 14:59                       ` Avi Kivity
2010-08-17 15:04                         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox