[Qemu-devel] [RFC] Disk integrity in QEMU

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC] Disk integrity in QEMU
@ 2008-10-09 17:00 Anthony Liguori
  2008-10-10  7:54 ` Gerd Hoffmann
                   ` (7 more replies)
  0 siblings, 8 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-09 17:00 UTC (permalink / raw)
  To: qemu-devel@nongnu.org
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

Hi,

There's been a lot of discussion recently mostly in other places about 
disk integrity and performance in QEMU.  I must admit, my own thinking 
has changed pretty recently in this space.  I wanted to try and focus 
the conversation on qemu-devel so that we could get everyone involved 
and come up with a plan for the future.

Right now, QEMU can open a file in two ways.  It can open it without any 
special caching flags (the default) or it can open it O_DIRECT.  
O_DIRECT implies that the IO does not go through the host page cache.  
This is controlled with cache=on and cache=off respectively.

When cache=on, read requests may not actually go to the disk.  If a 
previous read request (by some application on the system) has read the 
same data, then it becomes a simple memcpy().  Also, the host IO 
scheduler may do read ahead which means that the data may be available 
from that.  In general, the host knows the most about the underlying 
disk system and the total IO load on the system so it is far better 
suited to optimize these sort of things than the guest.

Write requests end up being simple memcpy()s too as the data is just 
copied into the page cache and the page is scheduled to be eventually 
written to disk.  Since we don't know when the data is actually written 
to disk, we tell the guest the data is written before it actually is.

If you assume that the host is stable, then there isn't an integrity 
issue.  This assumes that you have backup power and that the host OS has 
no bugs.  It's not a totally unreasonable assumption but for a large 
number of users, it's not a good assumption.

A side effect of cache=off is that data integrity only depends on the 
integrity of your storage system (which isn't always safe, btw) which is 
probably closer to what most users expect.  There many other side 
effects though.

An alternative to cache=off that addresses the data integrity problem 
directly is to open all disk images with O_DSYNC.  This will still use 
the host page cache (and therefore get all the benefits of it) but will 
only signal write completion when the data is actually written to disk.  
The effect of this is to make the integrity of the VM equal the 
integrity of the storage system (no longer relying on the host).  By 
still going through the page cache, you still get the benefits of the 
host's IO scheduler and read-ahead.  The only place affected by 
performance is writes (reads are equivalent).  If you run a write 
benchmark in a guest today, you'll see a number that is higher than 
native.  The implication here is that data integrity is not being 
maintained if you don't trust the host.  O_DSYNC takes care of this.

Read performance should be unaffected by using O_DSYNC.  O_DIRECT will 
significantly reduce read performance.  I think we should use O_DSYNC by 
default and I have sent out a patch that contains that.  We will follow 
up with benchmarks to demonstrate this.

There are certain benefits to using O_DIRECT.  One argument for using 
O_DIRECT is that you have to allocate memory in the host page cache to 
perform IO.  If you are not sharing data between guests, and the guest 
has a relatively large amount of memory compared to the host, and you 
have a simple disk in the host, going through the host page cache wastes 
some memory that could be used to cache other IO operations on the 
system.  I don't really think this is the typical case so I don't think 
this is an argument for having it on by default.  However, it can be 
enabled if you know this is going to be the case.

The biggest benefit to using O_DIRECT, is that you can potentially avoid 
ever bringing data into the CPUs cache.  Once data is cached, copying it 
is relatively cheap.  If you're never going to touch the data (think, 
disk DMA => nic DMA via sendfile()), then avoiding the CPU cache can be 
a big win.  Again, I don't think this is the common case but the option 
is there in case it's suitable.

An important point is that today, we always copy data internally in QEMU 
which means practically speaking, you'll never see this benefit.

So to summarize, I think we should enable O_DSYNC by default to ensure 
that guest data integrity is not dependent on the host OS, and that 
practically speaking, cache=off is only useful for very specialized 
circumstances.  Part of the patch I'll follow up with includes changes 
to the man page to document all of this for users.

Thoughts?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
@ 2008-10-10  7:54 ` Gerd Hoffmann
  2008-10-10  8:12   ` Mark McLoughlin
  2008-10-10  9:32   ` Avi Kivity
  2008-10-10  8:11 ` Aurelien Jarno
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 101+ messages in thread
From: Gerd Hoffmann @ 2008-10-10  7:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

  Hi,

> Read performance should be unaffected by using O_DSYNC.  O_DIRECT will
> significantly reduce read performance.  I think we should use O_DSYNC by
> default and I have sent out a patch that contains that.  We will follow
> up with benchmarks to demonstrate this.

So O_SYNC on/off is pretty much equivalent to disk write caching being
on/off, right?  So we could make that guest-controlled, i.e. toggeling
write caching in the guest (using hdparm) toggles O_SYNC in qemu?  This
together with disk-flush command support (mapping to fsync on the host)
should allow guests to go into barrier mode for better write performance
without loosing data integrity.

cheers,
  Gerd

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  7:54 ` Gerd Hoffmann
@ 2008-10-10  8:12   ` Mark McLoughlin
  2008-10-12 23:10     ` Jamie Lokier
  2008-10-10  9:32   ` Avi Kivity
  1 sibling, 1 reply; 101+ messages in thread
From: Mark McLoughlin @ 2008-10-10  8:12 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Chris Wright, kvm-devel, Ryan Harper, qemu-devel, Laurent Vivier

On Fri, 2008-10-10 at 09:54 +0200, Gerd Hoffmann wrote:
> Hi,
> 
> > Read performance should be unaffected by using O_DSYNC.  O_DIRECT will
> > significantly reduce read performance.  I think we should use O_DSYNC by
> > default and I have sent out a patch that contains that.  We will follow
> > up with benchmarks to demonstrate this.
> 
> So O_SYNC on/off is pretty much equivalent to disk write caching being
> on/off, right?  So we could make that guest-controlled, i.e. toggeling
> write caching in the guest (using hdparm) toggles O_SYNC in qemu?

I don't think it's correct to equate disk write caching to completing
guest writes when the data has been copied to the host's page cache. The
host's page cache will cache much more data for much longer than a
typical disk, right?

If so, then this form of write caching is much more likely to result in
fs corruption if the host crashes. In that case, all qemu users would
really need to disable write caching in the guest using hdparm, which
they don't need to do on bare-metal.

Cheers,
Mark.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  8:12   ` Mark McLoughlin
@ 2008-10-12 23:10     ` Jamie Lokier
  2008-10-14 17:15       ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Jamie Lokier @ 2008-10-12 23:10 UTC (permalink / raw)
  To: Mark McLoughlin, qemu-devel
  Cc: Chris Wright, kvm-devel, Ryan Harper, Gerd Hoffmann,
	Laurent Vivier

Mark McLoughlin wrote:
> > So O_SYNC on/off is pretty much equivalent to disk write caching being
> > on/off, right?  So we could make that guest-controlled, i.e. toggeling
> > write caching in the guest (using hdparm) toggles O_SYNC in qemu?
> 
> I don't think it's correct to equate disk write caching to completing
> guest writes when the data has been copied to the host's page cache. The
> host's page cache will cache much more data for much longer than a
> typical disk, right?
> 
> If so, then this form of write caching is much more likely to result in
> fs corruption if the host crashes. In that case, all qemu users would
> really need to disable write caching in the guest using hdparm, which
> they don't need to do on bare-metal.

However, should the effect of the guest turning off the IDE disk write
cache perhaps be identical to the guest issuing IDE cache flush commands
following every IDE write?

This could mean the host calling fdatasync, or fsync, or using
O_DSYNC, or O_DIRECT - whatever the host does for IDE flush cache.

What this means _exactly_ for data integrity is outside of qemu's
control and is a user & host configuration issue.  But qemu could
provide consistency at least.

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 23:10     ` Jamie Lokier
@ 2008-10-14 17:15       ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 17:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Gerd Hoffmann

Jamie Lokier wrote:
> However, should the effect of the guest turning off the IDE disk write
> cache perhaps be identical to the guest issuing IDE cache flush commands
> following every IDE write?
>
> This could mean the host calling fdatasync, or fsync, or using
> O_DSYNC, or O_DIRECT - whatever the host does for IDE flush cache.
>
> What this means _exactly_ for data integrity is outside of qemu's
> control and is a user & host configuration issue.  But qemu could
> provide consistency at least.
>   

We should completely ignore the guest IDE write cache.  It was brought
into life by the deficiencies of IDE which presented the user with an
impossible tradeoff -- you can choose between data loss and horrible
performance.  Since modern hardware doesn't require this tradeoff, there
is no reason to force the user to make these choices.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  7:54 ` Gerd Hoffmann
  2008-10-10  8:12   ` Mark McLoughlin
@ 2008-10-10  9:32   ` Avi Kivity
  2008-10-12 23:00     ` Jamie Lokier
  1 sibling, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-10  9:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Gerd Hoffmann wrote:
>   Hi,
>
>   
>> Read performance should be unaffected by using O_DSYNC.  O_DIRECT will
>> significantly reduce read performance.  I think we should use O_DSYNC by
>> default and I have sent out a patch that contains that.  We will follow
>> up with benchmarks to demonstrate this.
>>     
>
> So O_SYNC on/off is pretty much equivalent to disk write caching being
> on/off, right?  So we could make that guest-controlled, i.e. toggeling
> write caching in the guest (using hdparm) toggles O_SYNC in qemu?  This
> together with disk-flush command support (mapping to fsync on the host)
> should allow guests to go into barrier mode for better write performance
> without loosing data integrity.
>   

IDE write caching is very different from host write caching.

The IDE write cache is not susceptible to software failures (well it is 
susceptible to firmware failures, but let's ignore that).  It is likely 
to survive reset and perhaps even powerdown.  The risk window is a few 
megabytes and tens of milliseconds long.

The host pagecache will not survive software failures, resets, or 
powerdown.  The risk window is hundreds of megabytes and thousands of 
milliseconds long.

It's perfectly normal to leave a production system on IDE (though 
perhaps not a mission-critical database), but totally mad to do so with 
host caching.  I don't think we should tie data integrity to an IDE 
misfeature that doesn't even exist anymore (with the advent of SATA NCQ).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  9:32   ` Avi Kivity
@ 2008-10-12 23:00     ` Jamie Lokier
  0 siblings, 0 replies; 101+ messages in thread
From: Jamie Lokier @ 2008-10-12 23:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Avi Kivity wrote:
> The IDE write cache is not susceptible to software failures (well it is 
> susceptible to firmware failures, but let's ignore that).  It is likely 
> to survive reset and perhaps even powerdown.  The risk window is a few 
> megabytes and tens of milliseconds long.

Nonetheless, from yanking the power relatively often while using ext3
(this is on a host only, no qemu involved) I've seen a number of
corruption cases, and these all went away when the IDE write cache was
disabled, or when IDE write barriers were used.

This is a failure case which happens in real life, but not often if
you don't often yank the power during writes.

Just so you know.

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
  2008-10-10  7:54 ` Gerd Hoffmann
@ 2008-10-10  8:11 ` Aurelien Jarno
  2008-10-10 12:26   ` Anthony Liguori
  2008-10-10  9:16 ` Avi Kivity
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 101+ messages in thread
From: Aurelien Jarno @ 2008-10-10  8:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

On Thu, Oct 09, 2008 at 12:00:41PM -0500, Anthony Liguori wrote:

[snip]

> So to summarize, I think we should enable O_DSYNC by default to ensure  
> that guest data integrity is not dependent on the host OS, and that  
> practically speaking, cache=off is only useful for very specialized  
> circumstances.  Part of the patch I'll follow up with includes changes  
> to the man page to document all of this for users.
>
> Thoughts?
>

While I agree O_DSYNC should be the defaults, I wonder if we should keep
the current behaviour available for those who want it. We can imagine
the following options:
  cache=off	O_DIRECT
  cache=read	O_DSYNC		(default)
  cache=on	0

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  8:11 ` Aurelien Jarno
@ 2008-10-10 12:26   ` Anthony Liguori
  2008-10-10 12:53     ` Paul Brook
  2008-10-10 15:48     ` Aurelien Jarno
  0 siblings, 2 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-10 12:26 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Aurelien Jarno wrote:
> On Thu, Oct 09, 2008 at 12:00:41PM -0500, Anthony Liguori wrote:
>
> [snip]
>
>   
>> So to summarize, I think we should enable O_DSYNC by default to ensure  
>> that guest data integrity is not dependent on the host OS, and that  
>> practically speaking, cache=off is only useful for very specialized  
>> circumstances.  Part of the patch I'll follow up with includes changes  
>> to the man page to document all of this for users.
>>
>> Thoughts?
>>
>>     
>
> While I agree O_DSYNC should be the defaults, I wonder if we should keep
> the current behaviour available for those who want it. We can imagine
> the following options:
>   cache=off	O_DIRECT
>   cache=read	O_DSYNC		(default)
>   

Or maybe cache=off, cache=on, cache=wb.  So that the default was 
cache=on which is write-through, or the user can choose write-back caching.

But that said, I'm concerned that this is far too confusing for users.  
I don't think anyone is relying on disk write performance when in 
write-back mode simply because the guest already has a page cache so 
writes are already being completed instantaneously from the 
application's perspective.


Regards,

Anthony Liguori
>   cache=on	0
>
>   

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 12:26   ` Anthony Liguori
@ 2008-10-10 12:53     ` Paul Brook
  2008-10-10 13:55       ` Anthony Liguori
  2008-10-10 15:48     ` Aurelien Jarno
  1 sibling, 1 reply; 101+ messages in thread
From: Paul Brook @ 2008-10-10 12:53 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

> But that said, I'm concerned that this is far too confusing for users.
> I don't think anyone is relying on disk write performance when in
> write-back mode simply because the guest already has a page cache so
> writes are already being completed instantaneously from the
> application's perspective.

This isn't entirely true. With IDE devices you don't have command queueing, so 
it's easy for a large write to stall subsequent reads for a relatively long 
time.
I'm not sure how much this effects qemu, but I've definitely seen it happening 
on real hardware.

Paul

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 12:53     ` Paul Brook
@ 2008-10-10 13:55       ` Anthony Liguori
  2008-10-10 14:05         ` Paul Brook
  2008-10-10 14:19         ` Avi Kivity
  0 siblings, 2 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-10 13:55 UTC (permalink / raw)
  To: Paul Brook
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Paul Brook wrote:
>> But that said, I'm concerned that this is far too confusing for users.
>> I don't think anyone is relying on disk write performance when in
>> write-back mode simply because the guest already has a page cache so
>> writes are already being completed instantaneously from the
>> application's perspective.
>>     
>
> This isn't entirely true. With IDE devices you don't have command queueing, so 
> it's easy for a large write to stall subsequent reads for a relatively long 
> time.
> I'm not sure how much this effects qemu, but I've definitely seen it happening 
> on real hardware.
>   

I think that suggests we should have a cache=wb option and if people 
report slow downs with IDE, we can observe if cache=wb helps.  My 
suspicion is that it's not going to have a practical impact because as 
long as the operations are asynchronous (via DMA), then you're getting 
native-like performance.

My bigger concern is synchronous IO operations because then a guest VCPU 
is getting far less time to run and that may have a cascading effect on 
performance.

Anyway, I'll work up a new patch with cache=wb and repost.

Regards,

Anthony Liguori

> Paul
>   

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 13:55       ` Anthony Liguori
@ 2008-10-10 14:05         ` Paul Brook
  2008-10-10 14:19         ` Avi Kivity
  1 sibling, 0 replies; 101+ messages in thread
From: Paul Brook @ 2008-10-10 14:05 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

On Friday 10 October 2008, Anthony Liguori wrote:
> Paul Brook wrote:
> >> But that said, I'm concerned that this is far too confusing for users.
> >> I don't think anyone is relying on disk write performance when in
> >> write-back mode simply because the guest already has a page cache so
> >> writes are already being completed instantaneously from the
> >> application's perspective.
> >
> > This isn't entirely true. With IDE devices you don't have command
> > queueing, so it's easy for a large write to stall subsequent reads for a
> > relatively long time.
> > I'm not sure how much this effects qemu, but I've definitely seen it
> > happening on real hardware.
>
> I think that suggests we should have a cache=wb option and if people
> report slow downs with IDE, we can observe if cache=wb helps.  My
> suspicion is that it's not going to have a practical impact because as
> long as the operations are asynchronous (via DMA), then you're getting
> native-like performance.

Sounds reasonable to me.

Paul

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 13:55       ` Anthony Liguori
  2008-10-10 14:05         ` Paul Brook
@ 2008-10-10 14:19         ` Avi Kivity
  2008-10-17 13:14           ` Jens Axboe
  1 sibling, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-10 14:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Paul Brook

Anthony Liguori wrote:
>>
>> This isn't entirely true. With IDE devices you don't have command
>> queueing, so it's easy for a large write to stall subsequent reads
>> for a relatively long time.
>> I'm not sure how much this effects qemu, but I've definitely seen it
>> happening on real hardware.
>>   
>
> I think that suggests we should have a cache=wb option and if people
> report slow downs with IDE, we can observe if cache=wb helps.  My
> suspicion is that it's not going to have a practical impact because as
> long as the operations are asynchronous (via DMA), then you're getting
> native-like performance.
>
> My bigger concern is synchronous IO operations because then a guest
> VCPU is getting far less time to run and that may have a cascading
> effect on performance.

IDE is limited to 256 sectors per transaction, or 128KB.  If a sync
transaction takes 5 ms, then your write rate is limited to 25 MB/sec. 
It's much worse if you're allocating qcow2 data, so each transaction is
several sync writes.

Fabrice's point also holds: if the guest is issuing many write
transactions for some reason, you don't want them hammering the disk and
killing your desktop performance if you're just developing, say, a new
filesystem.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 14:19         ` Avi Kivity
@ 2008-10-17 13:14           ` Jens Axboe
  2008-10-19  9:13             ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Jens Axboe @ 2008-10-17 13:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Paul Brook

On Fri, Oct 10 2008, Avi Kivity wrote:
> Anthony Liguori wrote:
> >>
> >> This isn't entirely true. With IDE devices you don't have command
> >> queueing, so it's easy for a large write to stall subsequent reads
> >> for a relatively long time.
> >> I'm not sure how much this effects qemu, but I've definitely seen it
> >> happening on real hardware.
> >>   
> >
> > I think that suggests we should have a cache=wb option and if people
> > report slow downs with IDE, we can observe if cache=wb helps.  My
> > suspicion is that it's not going to have a practical impact because as
> > long as the operations are asynchronous (via DMA), then you're getting
> > native-like performance.
> >
> > My bigger concern is synchronous IO operations because then a guest
> > VCPU is getting far less time to run and that may have a cascading
> > effect on performance.
> 
> IDE is limited to 256 sectors per transaction, or 128KB.  If a sync
> transaction takes 5 ms, then your write rate is limited to 25 MB/sec. 
> It's much worse if you're allocating qcow2 data, so each transaction is
> several sync writes.

No it isn't, even most IDE drives support lba48 which raises that limit
to 64K sectors, or 32MB.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-17 13:14           ` Jens Axboe
@ 2008-10-19  9:13             ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-19  9:13 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Paul Brook

Jens Axboe wrote:
>> IDE is limited to 256 sectors per transaction, or 128KB.  If a sync
>> transaction takes 5 ms, then your write rate is limited to 25 MB/sec. 
>> It's much worse if you're allocating qcow2 data, so each transaction is
>> several sync writes.
>>     
>
> No it isn't, even most IDE drives support lba48 which raises that limit
> to 64K sectors, or 32MB.
>   

Right, and qemu even supports this.  Thanks for the correction.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 12:26   ` Anthony Liguori
  2008-10-10 12:53     ` Paul Brook
@ 2008-10-10 15:48     ` Aurelien Jarno
  1 sibling, 0 replies; 101+ messages in thread
From: Aurelien Jarno @ 2008-10-10 15:48 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

On Fri, Oct 10, 2008 at 07:26:00AM -0500, Anthony Liguori wrote:
> Aurelien Jarno wrote:
>> On Thu, Oct 09, 2008 at 12:00:41PM -0500, Anthony Liguori wrote:
>>
>> [snip]
>>
>>   
>>> So to summarize, I think we should enable O_DSYNC by default to 
>>> ensure  that guest data integrity is not dependent on the host OS, 
>>> and that  practically speaking, cache=off is only useful for very 
>>> specialized  circumstances.  Part of the patch I'll follow up with 
>>> includes changes  to the man page to document all of this for users.
>>>
>>> Thoughts?
>>>
>>>     
>>
>> While I agree O_DSYNC should be the defaults, I wonder if we should keep
>> the current behaviour available for those who want it. We can imagine
>> the following options:
>>   cache=off	O_DIRECT
>>   cache=read	O_DSYNC		(default)
>>   
>
> Or maybe cache=off, cache=on, cache=wb.  So that the default was  
> cache=on which is write-through, or the user can choose write-back 
> caching.
>
> But that said, I'm concerned that this is far too confusing for users.   
> I don't think anyone is relying on disk write performance when in  
> write-back mode simply because the guest already has a page cache so  
> writes are already being completed instantaneously from the  
> application's perspective.
>

Some of my setups rely on host cache. I am using a swap partition for
some guests in order to increase the available "memory" (some platforms
in qemu are limited to 256MB of RAM), and it that case I don't care 
about data integrity

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
  2008-10-10  7:54 ` Gerd Hoffmann
  2008-10-10  8:11 ` Aurelien Jarno
@ 2008-10-10  9:16 ` Avi Kivity
  2008-10-10  9:58   ` Daniel P. Berrange
                     ` (2 more replies)
  2008-10-10 10:03 ` Fabrice Bellard
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-10  9:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori wrote:

[O_DSYNC, O_DIRECT, and 0]

>
> Thoughts?

There are (at least) three usage models for qemu:

- OS development tool
- casual or client-side virtualization
- server partitioning

The last two uses are almost always in conjunction with a hypervisor.

When using qemu as an OS development tool, data integrity is not very 
important.  On the other hand, performance and caching are, especially 
as the guest is likely to be restarted multiple times so the guest page 
cache is of limited value.  For this use model the current default 
(write back cache) is fine.

The 'causal virtualization' use is when the user has a full native 
desktop, and is also running another operating system.  In this case, 
the host page cache is likely to be larger than the guest page cache.  
Data integrity is important, so write-back is out of the picture.  I 
guess for this use case O_DSYNC is preferred though O_DIRECT might not 
be significantly slower for long-running guests.  This is because reads 
are unlikely to be cached and writes will not benefit much from the host 
pagecache.

For server partitioning, data integrity and performance are critical.  
The host page cache is significantly smaller than the guest page cache; 
if you have spare memory, give it to your guests.  O_DIRECT is 
practically mandataed here; the host page cache does nothing except to 
impose an additional copy.

Given the rather small difference between O_DSYNC and O_DIRECT, I favor 
not adding O_DSYNC as it will add only marginal value.

Regarding choosing the default value, I think we should change the 
default to be safe, that is O_DIRECT.  If that is regarded as too 
radical, the default should be O_DSYNC with options to change it to 
O_DIRECT or writeback.  Note that some disk formats will need updating 
like qcow2 if they are not to have abyssal performance.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  9:16 ` Avi Kivity
@ 2008-10-10  9:58   ` Daniel P. Berrange
  2008-10-10 10:26     ` Avi Kivity
  2008-10-10 12:34   ` Anthony Liguori
  2008-10-11 17:54   ` Mark Wagner
  2 siblings, 1 reply; 101+ messages in thread
From: Daniel P. Berrange @ 2008-10-10  9:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

On Fri, Oct 10, 2008 at 11:16:05AM +0200, Avi Kivity wrote:
> Anthony Liguori wrote:
> 
> [O_DSYNC, O_DIRECT, and 0]
> 
> >
> >Thoughts?
> 
> There are (at least) three usage models for qemu:
> 
> - OS development tool
> - casual or client-side virtualization
> - server partitioning
> 
> The last two uses are almost always in conjunction with a hypervisor.
> 
> When using qemu as an OS development tool, data integrity is not very 
> important.  On the other hand, performance and caching are, especially 
> as the guest is likely to be restarted multiple times so the guest page 
> cache is of limited value.  For this use model the current default 
> (write back cache) is fine.

It is a myth that developers dont' care about data consistency / crash
safety. I've lost countless guest VMs to corruption when my host OS 
crashed & its just a waste of my time. Given the choice between 
likely-to-corrupt and not-likely-to-corrupt, even developers will 
want the latter. 

> The 'causal virtualization' use is when the user has a full native 
> desktop, and is also running another operating system.  In this case, 
> the host page cache is likely to be larger than the guest page cache.  
> Data integrity is important, so write-back is out of the picture.  I 
> guess for this use case O_DSYNC is preferred though O_DIRECT might not 
> be significantly slower for long-running guests.  This is because reads 
> are unlikely to be cached and writes will not benefit much from the host 
> pagecache.
> 
> For server partitioning, data integrity and performance are critical.  
> The host page cache is significantly smaller than the guest page cache; 
> if you have spare memory, give it to your guests.  O_DIRECT is 
> practically mandataed here; the host page cache does nothing except to 
> impose an additional copy.
> 
> Given the rather small difference between O_DSYNC and O_DIRECT, I favor 
> not adding O_DSYNC as it will add only marginal value.
> 
> Regarding choosing the default value, I think we should change the 
> default to be safe, that is O_DIRECT.  If that is regarded as too 
> radical, the default should be O_DSYNC with options to change it to 
> O_DIRECT or writeback.  Note that some disk formats will need updating 
> like qcow2 if they are not to have abyssal performance.

Absoutely agree that the default should be safe. I don't have enough 
knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies
we should choose the best setting by default, because we can't expect
users to know the tradeoffs either.

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  9:58   ` Daniel P. Berrange
@ 2008-10-10 10:26     ` Avi Kivity
  2008-10-10 12:59       ` Paul Brook
  0 siblings, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-10 10:26 UTC (permalink / raw)
  To: Daniel P. Berrange, qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Daniel P. Berrange wrote:
>> There are (at least) three usage models for qemu:
>>
>> - OS development tool
>> - casual or client-side virtualization
>> - server partitioning
>>
>> The last two uses are almost always in conjunction with a hypervisor.
>>
>> When using qemu as an OS development tool, data integrity is not very 
>> important.  On the other hand, performance and caching are, especially 
>> as the guest is likely to be restarted multiple times so the guest page 
>> cache is of limited value.  For this use model the current default 
>> (write back cache) is fine.
>>     
>
> It is a myth that developers dont' care about data consistency / crash
> safety. I've lost countless guest VMs to corruption when my host OS 
> crashed & its just a waste of my time. Given the choice between 
> likely-to-corrupt and not-likely-to-corrupt, even developers will 
> want the latter.
>   

There are other data integrity solutions for developers, like backups 
(unlikely, I know) or -snapshot.

> Absoutely agree that the default should be safe. I don't have enough 
> knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies
> we should choose the best setting by default, because we can't expect
> users to know the tradeoffs either.
>   

The fact that there are different use models for qemu implies that the 
default must be chosen at some higher level than qemu code itself.  It 
might be done using /etc/qemu or ~/.qemu, or at the management 
interface, but there is no best setting for qemu itself.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 10:26     ` Avi Kivity
@ 2008-10-10 12:59       ` Paul Brook
  2008-10-10 13:20         ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Paul Brook @ 2008-10-10 12:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Avi Kivity

>> - OS development tool
>> - casual or client-side virtualization
>> - server partitioning

> > Absoutely agree that the default should be safe. I don't have enough 
> > knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies
> > we should choose the best setting by default, because we can't expect
> > users to know the tradeoffs either.
>
> The fact that there are different use models for qemu implies that the
> default must be chosen at some higher level than qemu code itself.  It
> might be done using /etc/qemu or ~/.qemu, or at the management
> interface, but there is no best setting for qemu itself.

This suggests that the most appropriate defaults are for the users that are 
least likely to be using a management tool.  I'd guess that the server 
partitioning folks are most likely to be using a management tool, so qemu 
defaults should be setup for casual/development use. I don't have hard data 
to back this up though.

Paul

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 12:59       ` Paul Brook
@ 2008-10-10 13:20         ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-10 13:20 UTC (permalink / raw)
  To: Paul Brook
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Paul Brook wrote:
>>> Absoutely agree that the default should be safe. I don't have enough 
>>> knowledge to say whether O_DIRECT/O_DSYNC is best - which also implies
>>> we should choose the best setting by default, because we can't expect
>>> users to know the tradeoffs either.
>>>       
>> The fact that there are different use models for qemu implies that the
>> default must be chosen at some higher level than qemu code itself.  It
>> might be done using /etc/qemu or ~/.qemu, or at the management
>> interface, but there is no best setting for qemu itself.
>>     
>
> This suggests that the most appropriate defaults are for the users that are 
> least likely to be using a management tool.  I'd guess that the server 
> partitioning folks are most likely to be using a management tool, so qemu 
> defaults should be setup for casual/development use. I don't have hard data 
> to back this up though.
>   

I agree (as my own uses are of the development kind).  That rules out
O_DIRECT as the qemu-level default.  However I'm not sure writeback is a
good default, it's too risky (though I've never been bitten; and I've
had my share of host crashes).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  9:16 ` Avi Kivity
  2008-10-10  9:58   ` Daniel P. Berrange
@ 2008-10-10 12:34   ` Anthony Liguori
  2008-10-10 12:56     ` Avi Kivity
  2008-10-11  9:07     ` andrzej zaborowski
  2008-10-11 17:54   ` Mark Wagner
  2 siblings, 2 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-10 12:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Avi Kivity wrote:
> Anthony Liguori wrote:
>
> [O_DSYNC, O_DIRECT, and 0]
>
>>
>> Thoughts?
>
> There are (at least) three usage models for qemu:
>
> - OS development tool
> - casual or client-side virtualization
> - server partitioning
>
> The last two uses are almost always in conjunction with a hypervisor.
>
> When using qemu as an OS development tool, data integrity is not very 
> important.  On the other hand, performance and caching are, especially 
> as the guest is likely to be restarted multiple times so the guest 
> page cache is of limited value.  For this use model the current 
> default (write back cache) is fine.
>
> The 'causal virtualization' use is when the user has a full native 
> desktop, and is also running another operating system.  In this case, 
> the host page cache is likely to be larger than the guest page cache.  
> Data integrity is important, so write-back is out of the picture.  I 
> guess for this use case O_DSYNC is preferred though O_DIRECT might not 
> be significantly slower for long-running guests.  This is because 
> reads are unlikely to be cached and writes will not benefit much from 
> the host pagecache.
>
> For server partitioning, data integrity and performance are critical.  
> The host page cache is significantly smaller than the guest page 
> cache; if you have spare memory, give it to your guests.

I don't think this wisdom is bullet-proof.  In the case of server 
partitioning, if you're designing for the future then you can assume 
some form of host data deduplification either through qcow 
deduplification, a proper content addressable storage mechanism, or file 
system level deduplification.  It's becoming more common to see large 
amounts of homogeneous consolidation either because of cloud computing, 
virtual appliances, or just because most x86 virtualization involves 
Windows consolidation and there aren't that many versions of Windows.

In this case, there is an awful lot of opportunity for increasing 
overall system throughput by caching common data access across virtual 
machines.

> O_DIRECT is practically mandataed here; the host page cache does 
> nothing except to impose an additional copy.
>
> Given the rather small difference between O_DSYNC and O_DIRECT, I 
> favor not adding O_DSYNC as it will add only marginal value.

The difference isn't small.  Our fio runs are defeating the host page 
cache on write so we're adjusting the working set size.  But the 
difference in read performance between dsync and direct is many factors 
when the data can be cached.

> Regarding choosing the default value, I think we should change the 
> default to be safe, that is O_DIRECT.  If that is regarded as too 
> radical, the default should be O_DSYNC with options to change it to 
> O_DIRECT or writeback.  Note that some disk formats will need updating 
> like qcow2 if they are not to have abyssal performance.

I think qcow2 will be okay because the only issue is image expansion and 
that is a relatively uncommon case that is amortized throughout the life 
time of the VM.  So far, while there is objection to using O_DIRECT by 
default, I haven't seen any objection to O_DSYNC by default so as long 
as no one objects in the next few days, I think that's what we'll end up 
doing.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 12:34   ` Anthony Liguori
@ 2008-10-10 12:56     ` Avi Kivity
  2008-10-11  9:07     ` andrzej zaborowski
  1 sibling, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-10 12:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori wrote:
>>
>> For server partitioning, data integrity and performance are
>> critical.  The host page cache is significantly smaller than the
>> guest page cache; if you have spare memory, give it to your guests.
>
> I don't think this wisdom is bullet-proof.  In the case of server
> partitioning, if you're designing for the future then you can assume
> some form of host data deduplification either through qcow
> deduplification, a proper content addressable storage mechanism, or
> file system level deduplification.  It's becoming more common to see
> large amounts of homogeneous consolidation either because of cloud
> computing, virtual appliances, or just because most x86 virtualization
> involves Windows consolidation and there aren't that many versions of
> Windows.
>
> In this case, there is an awful lot of opportunity for increasing
> overall system throughput by caching common data access across virtual
> machines.

That's true.  But is the OS image a significant image of I/O in a
running system?

My guess is that it is not.

In any case, deduplication is far enough into the future to not attempt
to solve it now.  The solution may be part of the deduplication solution
itself, for example it may choose to cache shared data (since they are
read-only anyway) even with O_DIRECT.

>
>> O_DIRECT is practically mandataed here; the host page cache does
>> nothing except to impose an additional copy.
>>
>> Given the rather small difference between O_DSYNC and O_DIRECT, I
>> favor not adding O_DSYNC as it will add only marginal value.
>
> The difference isn't small.  Our fio runs are defeating the host page
> cache on write so we're adjusting the working set size.  But the
> difference in read performance between dsync and direct is many
> factors when the data can be cached.
>

That's because you're leaving host memory idle.  That's not a realistic
scenario.  What happens if you assign free host memory to the guest?

>> Regarding choosing the default value, I think we should change the
>> default to be safe, that is O_DIRECT.  If that is regarded as too
>> radical, the default should be O_DSYNC with options to change it to
>> O_DIRECT or writeback.  Note that some disk formats will need
>> updating like qcow2 if they are not to have abyssal performance.
>
> I think qcow2 will be okay because the only issue is image expansion
> and that is a relatively uncommon case that is amortized throughout
> the life time of the VM.  So far, while there is objection to using
> O_DIRECT by default, I haven't seen any objection to O_DSYNC by
> default so as long as no one objects in the next few days, I think
> that's what we'll end up doing.

I don't mind that as long as there is a way to request O_DIRECT (which I
think is cache=off under your proposal).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10 12:34   ` Anthony Liguori
  2008-10-10 12:56     ` Avi Kivity
@ 2008-10-11  9:07     ` andrzej zaborowski
  1 sibling, 0 replies; 101+ messages in thread
From: andrzej zaborowski @ 2008-10-11  9:07 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

2008/10/10 Anthony Liguori <anthony@codemonkey.ws>:
> I think qcow2 will be okay because the only issue is image expansion and
> that is a relatively uncommon case that is amortized throughout the life
> time of the VM.

It's discutible how common this is and if you can count on the
amortization.  I'd say that for most users creating new short-lived
VMs is the bigger slice of their time using qemu.  Fore example think
about the trying out different distros like with free.oszoo.org, most
images there are qcow2.  Similarly trying to install an os and booting
its kernel with different options in sequence, is where waiting is
most annoying.  Also -snapshot uses qcow2.

In any case let's have benchmarks before deciding anything about
chagning the default behavior.  Since about 0.9.0 qemu is going
through a lot of (necessary) changes that in a great part were slow
downs, and they really accumulated.

Regards

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-10  9:16 ` Avi Kivity
  2008-10-10  9:58   ` Daniel P. Berrange
  2008-10-10 12:34   ` Anthony Liguori
@ 2008-10-11 17:54   ` Mark Wagner
  2008-10-11 20:35     ` Anthony Liguori
  2 siblings, 1 reply; 101+ messages in thread
From: Mark Wagner @ 2008-10-11 17:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Avi Kivity wrote:
> Anthony Liguori wrote:
> 
> [O_DSYNC, O_DIRECT, and 0]
> 
>>
>> Thoughts?
> 
> There are (at least) three usage models for qemu:
> 
> - OS development tool
> - casual or client-side virtualization
> - server partitioning
> 
> The last two uses are almost always in conjunction with a hypervisor.
> 
> When using qemu as an OS development tool, data integrity is not very 
> important.  On the other hand, performance and caching are, especially 
> as the guest is likely to be restarted multiple times so the guest page 
> cache is of limited value.  For this use model the current default 
> (write back cache) is fine.
> 
> The 'causal virtualization' use is when the user has a full native 
> desktop, and is also running another operating system.  In this case, 
> the host page cache is likely to be larger than the guest page cache.  
> Data integrity is important, so write-back is out of the picture.  I 
> guess for this use case O_DSYNC is preferred though O_DIRECT might not 
> be significantly slower for long-running guests.  This is because reads 
> are unlikely to be cached and writes will not benefit much from the host 
> pagecache.
> 
> For server partitioning, data integrity and performance are critical.  
> The host page cache is significantly smaller than the guest page cache; 
> if you have spare memory, give it to your guests.  O_DIRECT is 
> practically mandataed here; the host page cache does nothing except to 
> impose an additional copy.
> 
> Given the rather small difference between O_DSYNC and O_DIRECT, I favor 
> not adding O_DSYNC as it will add only marginal value.
> 
> Regarding choosing the default value, I think we should change the 
> default to be safe, that is O_DIRECT.  If that is regarded as too 
> radical, the default should be O_DSYNC with options to change it to 
> O_DIRECT or writeback.  Note that some disk formats will need updating 
> like qcow2 if they are not to have abyssal performance.
> 

I think one of the main things to be considered is the integrity of the
actual system call.  The Linux manpage for open() states the following
about the use of the O_DIRECT flag:

O_DIRECT (Since Linux 2.6.10)
Try to minimize cache effects of the I/O to and from this file.  In
general this will degrade performance, but it is useful  in  special
situations,  such as when applications do their own caching.  File
I/O is done directly to/from user space buffers.  The I/O is
synchronous, that is, at the completion of a read(2) or write(2),
data is guaranteed to  have  been  transferred.   Under  Linux  2.4
transfer  sizes, and the alignment of user buffer and file offset
must all be multiples of the logical block size of the file system.
Under Linux 2.6 alignment to 512-byte boundaries suffices.

If I focus on the sentence "The I/O is synchronous, that is, at
the completion of a read(2) or write(2), data is guaranteed to have
been transferred. ", I think there a bug here. If I open a
file with the O_DIRECT flag and the host reports back to me that
the transfer has completed when in fact its still in the host cache,
its a bug as it violates the open()/write() call and there is no
guarantee that the data will actually be written.

So I guess the real issue isn't what the default should be (although
the performance team at Red Hat would vote for cache=off), the real
issue is that we need to honor the system call from the guest. If
the file is opened with O_DIRECT on the guest, then the host needs
to honor that and do the same.

-mark

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-11 17:54   ` Mark Wagner
@ 2008-10-11 20:35     ` Anthony Liguori
  2008-10-12  0:43       ` Mark Wagner
                         ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-11 20:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Mark Wagner wrote:
> Avi Kivity wrote:
>
> I think one of the main things to be considered is the integrity of the
> actual system call.  The Linux manpage for open() states the following
> about the use of the O_DIRECT flag:
>
> O_DIRECT (Since Linux 2.6.10)
> Try to minimize cache effects of the I/O to and from this file.  In
> general this will degrade performance, but it is useful  in  special
> situations,  such as when applications do their own caching.  File
> I/O is done directly to/from user space buffers.  The I/O is
> synchronous, that is, at the completion of a read(2) or write(2),
> data is guaranteed to  have  been  transferred.   Under  Linux  2.4
> transfer  sizes, and the alignment of user buffer and file offset
> must all be multiples of the logical block size of the file system.
> Under Linux 2.6 alignment to 512-byte boundaries suffices.
>
>
> If I focus on the sentence "The I/O is synchronous, that is, at
> the completion of a read(2) or write(2), data is guaranteed to have
> been transferred. ",

It's extremely important to understand what the guarantee is.  The 
guarantee is that upon completion on write(), the data will have been 
reported as written by the underlying storage subsystem.  This does 
*not* mean that the data is on disk.

If you have a normal laptop, your disk has a cache.  That cache does not 
have a battery backup.  Under normal operations, the cache is acting in 
write-back mode and when you do a write, the disk will report the write 
as completed even though it is not actually on disk.  If you really care 
about the data being on disk, you have to either use a disk with a 
battery backed cache (much more expensive) or enable write-through 
caching (will significantly reduce performance).

In the case of KVM, even using write-back caching with the host page 
cache, we are still honoring the guarantee of O_DIRECT.  We just have 
another level of caching that happens to be write-back.

> I think there a bug here. If I open a
> file with the O_DIRECT flag and the host reports back to me that
> the transfer has completed when in fact its still in the host cache,
> its a bug as it violates the open()/write() call and there is no
> guarantee that the data will actually be written.

This is very important, O_DIRECT does *not* guarantee that data actually 
resides on disk.  There are many possibly places that it can be cached 
(in the storage controller, in the disks themselves, in a RAID controller).

> So I guess the real issue isn't what the default should be (although
> the performance team at Red Hat would vote for cache=off),

The consensus so far has been that we want to still use the host page 
cache but use it in write-through mode.  This would mean that the guest 
would only see data completion when the host's storage subsystem reports 
the write as having completed.  This is not the same as cache=off but I 
think gives the real effect that is desired.

Do you have another argument for using cache=off?

Regards,

Anthony Liguori

> the real
> issue is that we need to honor the system call from the guest. If
> the file is opened with O_DIRECT on the guest, then the host needs
> to honor that and do the same.
>
> -mark
>
>
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-11 20:35     ` Anthony Liguori
@ 2008-10-12  0:43       ` Mark Wagner
  2008-10-12  1:50         ` Chris Wright
  2008-10-12 17:54         ` Anthony Liguori
  2008-10-12  0:44       ` Chris Wright
  2008-10-12 10:12       ` Avi Kivity
  2 siblings, 2 replies; 101+ messages in thread
From: Mark Wagner @ 2008-10-12  0:43 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Anthony Liguori wrote:

Note

I think that are two distinct arguments going on here. My main concern is
that I don't think that this a simple "what do we make the default cache policy
be" issue. I think that regardless of the cache policy, if something in the
guest requests O_DIRECT, the host must honor that and not cache the data.

So in the following discussion below, the question of what the default cache
flag should be and the question of the host needing to honor O_DIRECT in a
guest are somewhat intermingled...

> Mark Wagner wrote:
>> Avi Kivity wrote:
>>
>> I think one of the main things to be considered is the integrity of the
>> actual system call.  The Linux manpage for open() states the following
>> about the use of the O_DIRECT flag:
>>
>> O_DIRECT (Since Linux 2.6.10)
>> Try to minimize cache effects of the I/O to and from this file.  In
>> general this will degrade performance, but it is useful  in  special
>> situations,  such as when applications do their own caching.  File
>> I/O is done directly to/from user space buffers.  The I/O is
>> synchronous, that is, at the completion of a read(2) or write(2),
>> data is guaranteed to  have  been  transferred.   Under  Linux  2.4
>> transfer  sizes, and the alignment of user buffer and file offset
>> must all be multiples of the logical block size of the file system.
>> Under Linux 2.6 alignment to 512-byte boundaries suffices.
>>
>>
>> If I focus on the sentence "The I/O is synchronous, that is, at
>> the completion of a read(2) or write(2), data is guaranteed to have
>> been transferred. ",
> 
> It's extremely important to understand what the guarantee is.  The 
> guarantee is that upon completion on write(), the data will have been 
> reported as written by the underlying storage subsystem.  This does 
> *not* mean that the data is on disk.

I apologize if I worded it poorly, I assume that the guarantee is that
the data has been sent to the storage controller and said controller
sent an indication that the write has completed.  This could mean
multiple things likes its in the controllers cache, on the disk, etc.

I do not believe that this means that the data is still sitting in the
host cache.  I realize it may not yet be on a disk, but, at a minimum,
I would expect that is has been sent to the storage controller.  Do you
consider the hosts cache to be part of the storage subsystem ?

> 
> If you have a normal laptop, your disk has a cache.  That cache does not 
> have a battery backup.  Under normal operations, the cache is acting in 
> write-back mode and when you do a write, the disk will report the write 
> as completed even though it is not actually on disk.  If you really care 
> about the data being on disk, you have to either use a disk with a 
> battery backed cache (much more expensive) or enable write-through 
> caching (will significantly reduce performance).
> 

We are testing things on the big side.  Systems with 32 GB of mem,
2 TB of enterprise storage (MSA, EVA, etc).  There is a write cache with
battery backup on the storage controllers.  We understand the trade offs
between the life-time of the battery and the potential data loss because
they are well documented and we can make informed decisions because we
know they are there.

I think that people are too quickly assuming that because an IDE drive
will cache your writes *if you let it*, then its clearly OK for the host
to lie to the guests when they request O_DIRECT and cache whatever the
developers feel like.  I think the leap to get from the write cache on
an IDE drive to "its OK to cache what ever we want on the host" is huge,
and deadly.

Keep in mind, the disk on a laptop is not caching GB worth of data
like the host can. The impact is that while there is a chance of data
loss with my laptop if I leave the disk cache on, the amount of data is
much smaller and the time it takes to flush the disks cache is also
much smaller than a multi-GB cache on my host.

> In the case of KVM, even using write-back caching with the host page 
> cache, we are still honoring the guarantee of O_DIRECT.  We just have 
> another level of caching that happens to be write-back.

I still don't get it.  If I have something running on the host that I
open with O_DIRECT, do you still consider it not to be a violation of
the system call if that data ends up in the host cache instead of being
sent to the storage controller?

If you do think it violates the terms of the call, then what is the
difference between the host and a guest in this situation?
QEMU is clearly not a battery backed storage controller.

> 
>> I think there a bug here. If I open a
>> file with the O_DIRECT flag and the host reports back to me that
>> the transfer has completed when in fact its still in the host cache,
>> its a bug as it violates the open()/write() call and there is no
>> guarantee that the data will actually be written.
> 
> This is very important, O_DIRECT does *not* guarantee that data actually 
> resides on disk.  There are many possibly places that it can be cached 
> (in the storage controller, in the disks themselves, in a RAID controller).
> 
I don't believe I said was on the disk, just that the host indicated
to the guest that the write had completed. Everything you mentioned
could be considered external to the OS. You didn't mention the host
page cache, is it allowed there or not?

>> So I guess the real issue isn't what the default should be (although
>> the performance team at Red Hat would vote for cache=off),
> 
> The consensus so far has been that we want to still use the host page 
> cache but use it in write-through mode.  This would mean that the guest 
> would only see data completion when the host's storage subsystem reports 
> the write as having completed.  This is not the same as cache=off but I 
> think gives the real effect that is desired.
> 
> Do you have another argument for using cache=off?
Thats not the argument I'm trying to make.

Well I guess I still didn't make my point clearly. cache=off seems to be
a band-aid to the fact that the host is not honoring the O_DIRECT flag.
I can easily see a malicious use of the cache=on flag to inject something
into the data stream or highjack said stream from a guest app that
requested O_DIRECT. While this is also possible in may other ways, in this
particular case it is enabled via the config option in QEMU.  I can easily
see something as simple as setting a large page cache, config the guests
to use cache=on and then every second messing with the caches in order to
cause data corruption. (wonder if  "echo 1 > /proc/sys/vm/drop_caches will
do the trick ?)".  From the guests perspective, they have been guaranteed
that their data is secure, but it real isn't.

We are testing with Oracle right now. Oracle assumes it has control of the
storage and does lots of things assuming direct IO. However, I can configure
cache=on for the storage presented to the guest and Oracle really won't have
direct control because there is a host cache in the way.

If I run the same Oracle config on bare metal, it does have direct control
because the OS knows that the host cache must be bypassed.

The end result is that the final behavior of guest OS is drastically different
than that of the same OS running on a host because I can configure QEMU to
hijack the data underneath the actual call and at a minimum, delay it from
going to the external storage subsystem where the application expects it to be.
The impact of this decision is that this is causing QEMU to be unreliable
for any type of use that requires data integrity and unsuitable for any
type of enterprise deployment.

-mark

> Regards,
> 
> Anthony Liguori
> 
>> the real
>> issue is that we need to honor the system call from the guest. If
>> the file is opened with O_DIRECT on the guest, then the host needs
>> to honor that and do the same.
>>
>> -mark
>>
>>
>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12  0:43       ` Mark Wagner
@ 2008-10-12  1:50         ` Chris Wright
  2008-10-12 16:22           ` Jamie Lokier
  2008-10-12 17:54         ` Anthony Liguori
  1 sibling, 1 reply; 101+ messages in thread
From: Chris Wright @ 2008-10-12  1:50 UTC (permalink / raw)
  To: Mark Wagner
  Cc: Chris Wright, Mark McLoughlin, kvm, Laurent Vivier, qemu-devel,
	Ryan Harper

* Mark Wagner (mwagner@redhat.com) wrote:
> I think that are two distinct arguments going on here. My main concern is
> that I don't think that this a simple "what do we make the default cache policy
> be" issue. I think that regardless of the cache policy, if something in the
> guest requests O_DIRECT, the host must honor that and not cache the data.

OK, O_DIRECT in the guest is just one example of the guest requesting
data to be synchronously written to disk.  It bypasses guest page cache,
but even page cached writes need to be written at some point.  Any time
the disk driver issues an io where it expects the data to be on disk
(possible low-level storage subystem caching) is the area of concern.

* Mark Wagner (mwagner@redhat.com) wrote:
> Anthony Liguori wrote:
>> It's extremely important to understand what the guarantee is.  The  
>> guarantee is that upon completion on write(), the data will have been  
>> reported as written by the underlying storage subsystem.  This does  
>> *not* mean that the data is on disk.
>
> I apologize if I worded it poorly, I assume that the guarantee is that
> the data has been sent to the storage controller and said controller
> sent an indication that the write has completed.  This could mean
> multiple things likes its in the controllers cache, on the disk, etc.
>
> I do not believe that this means that the data is still sitting in the
> host cache.  I realize it may not yet be on a disk, but, at a minimum,
> I would expect that is has been sent to the storage controller.  Do you
> consider the hosts cache to be part of the storage subsystem ?

Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get
it through to host's storage subsytem, and I think that's been the core
of the discussion (plus defaults, etc).

>> In the case of KVM, even using write-back caching with the host page  
>> cache, we are still honoring the guarantee of O_DIRECT.  We just have  
>> another level of caching that happens to be write-back.
>
> I still don't get it.  If I have something running on the host that I
> open with O_DIRECT, do you still consider it not to be a violation of
> the system call if that data ends up in the host cache instead of being
> sent to the storage controller?

I suppose an argument could be made for host caching and write-back
to be considered part of the storage subsystem from the guest pov, but
then we also need to bring in the requirement for proper cache flushing.
Given a popular linux guest fs can be a little fast and loose, wb and
flushing isn't really optimal choice for the integrity case.

thanks,
-chris

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12  1:50         ` Chris Wright
@ 2008-10-12 16:22           ` Jamie Lokier
  0 siblings, 0 replies; 101+ messages in thread
From: Jamie Lokier @ 2008-10-12 16:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm, Laurent Vivier, Ryan Harper,
	Mark Wagner

Chris Wright wrote:
> Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get
> it through to host's storage subsytem, and I think that's been the core
> of the discussion (plus defaults, etc).

Just want to point out that the storage commitment from O_DIRECT can
be _weaker_ than O_DSYNC.

On Linux,m O_DIRECT never uses storage-device barriers or
transactions, but O_DSYNC sometimes does, and fsync is even more
likely to than O_DSYNC.

I'm not certain, but I think the same applies to other host OSes too -
including Windows, which has its own equivalents to O_DSYNC and
O_DIRECT, and extra documented semantics when they are used together.

Although this is a host implementation detail, unfortunately it means
that O_DIRECT=no-cache and O_DSYNC=write-through-cache is not an
accurate characterisation.

Some might be mislead into assuming that "cache=off" is as strongly
committing their data to hard storage as "cache=wb" would.

I think you can assume this only when the underlying storage devices'
write caches are disabled.  You cannot assume this if the host
filesystem uses barriers instead of disabling the storage devices'
write cache.

Unfortunately there's not a lot qemu can do about these various quirks,
but at least it should be documented, so that someone requiring
storage commitment (e.g. for a critical guest database) is advised to
investigate whether O_DIRECT and/or O_DSYNC give them what they
require with their combination of host kernel, filesystem, filesystem
options and storage device(s).

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12  0:43       ` Mark Wagner
  2008-10-12  1:50         ` Chris Wright
@ 2008-10-12 17:54         ` Anthony Liguori
  2008-10-12 18:14           ` nuitari-qemu
  2008-10-13  0:27           ` Mark Wagner
  1 sibling, 2 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-12 17:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Mark Wagner wrote:
>
> I do not believe that this means that the data is still sitting in the
> host cache.  I realize it may not yet be on a disk, but, at a minimum,
> I would expect that is has been sent to the storage controller.  Do you
> consider the hosts cache to be part of the storage subsystem ?

Yes.  And the storage subsystem is often complicated like this.  
Consider if you had a hardware iSCSI initiator.  The host just sees a 
SCSI disk and when the writes are issued as completed, that simply means 
the writes have gone to the iSCSI server.  The iSCSI server may have its 
own cache or some deep storage multi-level cached storage subsystem.

The fact that the virtualization layer has a cache is really not that 
unusual.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 17:54         ` Anthony Liguori
@ 2008-10-12 18:14           ` nuitari-qemu
  2008-10-13  0:27           ` Mark Wagner
  1 sibling, 0 replies; 101+ messages in thread
From: nuitari-qemu @ 2008-10-12 18:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

>>  I do not believe that this means that the data is still sitting in the
>>  host cache.  I realize it may not yet be on a disk, but, at a minimum,
>>  I would expect that is has been sent to the storage controller.  Do you
>>  consider the hosts cache to be part of the storage subsystem ?
>
> The fact that the virtualization layer has a cache is really not that 
> unusual.

Wouldn't it be better to have cache=on/off control wether or not qemu/kvm 
does any caching on their own and have a different configuration option 
for O_DIRECT / O_DSYNC on the diskfiles?

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 17:54         ` Anthony Liguori
  2008-10-12 18:14           ` nuitari-qemu
@ 2008-10-13  0:27           ` Mark Wagner
  2008-10-13  1:21             ` Anthony Liguori
  1 sibling, 1 reply; 101+ messages in thread
From: Mark Wagner @ 2008-10-13  0:27 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Anthony Liguori wrote:
> Mark Wagner wrote:
>>
>> I do not believe that this means that the data is still sitting in the
>> host cache.  I realize it may not yet be on a disk, but, at a minimum,
>> I would expect that is has been sent to the storage controller.  Do you
>> consider the hosts cache to be part of the storage subsystem ?
> 
> Yes.  And the storage subsystem is often complicated like this.  
> Consider if you had a hardware iSCSI initiator.  The host just sees a 
> SCSI disk and when the writes are issued as completed, that simply means 
> the writes have gone to the iSCSI server.  The iSCSI server may have its 
> own cache or some deep storage multi-level cached storage subsystem.
> 
If you stopped and listened to yourself, you'd see that you are making my point...

AFAIK, QEMU is neither designed nor intended to be an Enterprise Storage Array,
I thought this group is designing a virtualization layer.  However, the persistent
argument is that since Enterprise Storage products will often acknowledge a write
before the data is actually on the disk, its OK for QEMU to do the same. If QEMU
had a similar design to Enterprise Storage with redundancy, battery backup, etc, I'd
be fine with it, but you don't. QEMU is a layer that I've also thought was suppose
to be small, lightweight and unobtrusive that is silently putting everyones data
at risk.

The low-end iSCSI server from EqualLogic claims:
	"it combines intelligence and automation with fault tolerance"
	"Dual, redundant controllers with a total of 4 GB battery-backed memory"

AFAIK QEMU provides neither of these characteristics.

-mark

> The fact that the virtualization layer has a cache is really not that 
> unusual.
Do other virtualization layers lie to the guest and indicate that the data
has successfully been ACK'd by the storage subsystem when the data is actually
still in the host cache?


-mark
> 
> Regards,
> 
> Anthony Liguori
> 
> 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13  0:27           ` Mark Wagner
@ 2008-10-13  1:21             ` Anthony Liguori
  2008-10-13  2:09               ` Mark Wagner
  0 siblings, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-13  1:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Mark Wagner wrote:
> If you stopped and listened to yourself, you'd see that you are making 
> my point...
>
> AFAIK, QEMU is neither designed nor intended to be an Enterprise 
> Storage Array,
> I thought this group is designing a virtualization layer.  However, 
> the persistent
> argument is that since Enterprise Storage products will often 
> acknowledge a write
> before the data is actually on the disk, its OK for QEMU to do the same.

I think you're a little lost in this thread.  We're going to have QEMU 
only acknowledge writes when they complete.  I've already sent out a 
patch.  Just waiting a couple days to let everyone give their input.

> If QEMU
> had a similar design to Enterprise Storage with redundancy, battery 
> backup, etc, I'd
> be fine with it, but you don't. QEMU is a layer that I've also thought 
> was suppose
> to be small, lightweight and unobtrusive that is silently putting 
> everyones data
> at risk.
>
> The low-end iSCSI server from EqualLogic claims:
>     "it combines intelligence and automation with fault tolerance"
>     "Dual, redundant controllers with a total of 4 GB battery-backed 
> memory"
>
> AFAIK QEMU provides neither of these characteristics.

So if this is your only concern, we're in violent agreement.  You were 
previously arguing that we should use O_DIRECT in the host if we're not 
"lying" about write completions anymore.  That's what I'm opposing 
because the details of whether we use O_DIRECT or not have absolutely 
nothing to do with data integrity as long as we're using O_DSYNC.

Regards,

Anthony Liguori

>
> -mark
>
>> The fact that the virtualization layer has a cache is really not that 
>> unusual.
> Do other virtualization layers lie to the guest and indicate that the 
> data
> has successfully been ACK'd by the storage subsystem when the data is 
> actually
> still in the host cache?
>
>
> -mark
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13  1:21             ` Anthony Liguori
@ 2008-10-13  2:09               ` Mark Wagner
  2008-10-13  3:16                 ` Anthony Liguori
                                   ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Mark Wagner @ 2008-10-13  2:09 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Anthony Liguori wrote:
> Mark Wagner wrote:
>> If you stopped and listened to yourself, you'd see that you are making 
>> my point...
>>
>> AFAIK, QEMU is neither designed nor intended to be an Enterprise 
>> Storage Array,
>> I thought this group is designing a virtualization layer.  However, 
>> the persistent
>> argument is that since Enterprise Storage products will often 
>> acknowledge a write
>> before the data is actually on the disk, its OK for QEMU to do the same.
> 
> I think you're a little lost in this thread.  We're going to have QEMU 
> only acknowledge writes when they complete.  I've already sent out a 
> patch.  Just waiting a couple days to let everyone give their input.
> 
Actually, I'm just don't being clear enough in trying to point out that I
don't think just setting a default value for "cache" goes far enough. My
argument has nothing to do with the default value. It has to do with what the
right thing to do is in specific situations regardless of the value of the
cache setting.

My point is that if a file is opened in the guest with the O_DIRECT (or O_DSYNC)
then QEMU *must* honor that regardless of whatever value the current value of
"cache" is.

So, if the system admin for the host decides to set cache=on and something
in the guest opens a file with O_DIRECT, I feel that it is a violation
of the system call for the host to cache the write in its local cache w/o
sending it immediately to the storage subsystem. It must get an ACK from
the storage subsystem before it can return to the guest in order to preserve
the guarantee.

So, if your proposed default value for the cache is in effect, then O_DSYNC
should provide the write-thru required by the guests use of O_DIRECT on the
writes.  However, if the default cache value is not used and its set to
cache=on, and if the guest is using O_DIRECT or O_DSYNC, I feel there are
issues that need to be addressed.

-mark

>> If QEMU
>> had a similar design to Enterprise Storage with redundancy, battery 
>> backup, etc, I'd
>> be fine with it, but you don't. QEMU is a layer that I've also thought 
>> was suppose
>> to be small, lightweight and unobtrusive that is silently putting 
>> everyones data
>> at risk.
>>
>> The low-end iSCSI server from EqualLogic claims:
>>     "it combines intelligence and automation with fault tolerance"
>>     "Dual, redundant controllers with a total of 4 GB battery-backed 
>> memory"
>>
>> AFAIK QEMU provides neither of these characteristics.
> 
> So if this is your only concern, we're in violent agreement.  You were 
> previously arguing that we should use O_DIRECT in the host if we're not 
> "lying" about write completions anymore.  That's what I'm opposing 
> because the details of whether we use O_DIRECT or not have absolutely 
> nothing to do with data integrity as long as we're using O_DSYNC.
> 
> Regards,
> 
> Anthony Liguori
> 
>>
>> -mark
>>
>>> The fact that the virtualization layer has a cache is really not that 
>>> unusual.
>> Do other virtualization layers lie to the guest and indicate that the 
>> data
>> has successfully been ACK'd by the storage subsystem when the data is 
>> actually
>> still in the host cache?
>>
>>
>> -mark
>>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>>
>>
>>
>>
> 
> 
> 

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13  2:09               ` Mark Wagner
@ 2008-10-13  3:16                 ` Anthony Liguori
  2008-10-13  6:42                 ` Aurelien Jarno
  2008-10-13 14:38                 ` Steve Ofsthun
  2 siblings, 0 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-13  3:16 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Mark Wagner wrote:
> So, if your proposed default value for the cache is in effect, then 
> O_DSYNC
> should provide the write-thru required by the guests use of O_DIRECT 
> on the
> writes.  However, if the default cache value is not used and its set to
> cache=on, and if the guest is using O_DIRECT or O_DSYNC, I feel there are

The option would be cache=writeback and the man pages have a pretty 
clear warning in it that it could lead to data loss.

It's used for -snapshot and it's totally safe for that (and improves 
write performance for that case).  It's also there because a number of 
people expressed a concern that they did not care about data integrity 
and wished to be able to get the performance boost.  I don't see a harm 
in that since I think we'll now have adequate documentation.

Regards,

Anthony Liguori

>
> issues that need to be addressed.
>
> -mark
>
>>> If QEMU
>>> had a similar design to Enterprise Storage with redundancy, battery 
>>> backup, etc, I'd
>>> be fine with it, but you don't. QEMU is a layer that I've also 
>>> thought was suppose
>>> to be small, lightweight and unobtrusive that is silently putting 
>>> everyones data
>>> at risk.
>>>
>>> The low-end iSCSI server from EqualLogic claims:
>>>     "it combines intelligence and automation with fault tolerance"
>>>     "Dual, redundant controllers with a total of 4 GB battery-backed 
>>> memory"
>>>
>>> AFAIK QEMU provides neither of these characteristics.
>>
>> So if this is your only concern, we're in violent agreement.  You 
>> were previously arguing that we should use O_DIRECT in the host if 
>> we're not "lying" about write completions anymore.  That's what I'm 
>> opposing because the details of whether we use O_DIRECT or not have 
>> absolutely nothing to do with data integrity as long as we're using 
>> O_DSYNC.
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>>
>>> -mark
>>>
>>>> The fact that the virtualization layer has a cache is really not 
>>>> that unusual.
>>> Do other virtualization layers lie to the guest and indicate that 
>>> the data
>>> has successfully been ACK'd by the storage subsystem when the data 
>>> is actually
>>> still in the host cache?
>>>
>>>
>>> -mark
>>>>
>>>> Regards,
>>>>
>>>> Anthony Liguori
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13  2:09               ` Mark Wagner
  2008-10-13  3:16                 ` Anthony Liguori
@ 2008-10-13  6:42                 ` Aurelien Jarno
  2008-10-13 14:38                 ` Steve Ofsthun
  2 siblings, 0 replies; 101+ messages in thread
From: Aurelien Jarno @ 2008-10-13  6:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Mark Wagner a écrit :
> Anthony Liguori wrote:
>> Mark Wagner wrote:
>>> If you stopped and listened to yourself, you'd see that you are making 
>>> my point...
>>>
>>> AFAIK, QEMU is neither designed nor intended to be an Enterprise 
>>> Storage Array,
>>> I thought this group is designing a virtualization layer.  However, 
>>> the persistent
>>> argument is that since Enterprise Storage products will often 
>>> acknowledge a write
>>> before the data is actually on the disk, its OK for QEMU to do the same.
>> I think you're a little lost in this thread.  We're going to have QEMU 
>> only acknowledge writes when they complete.  I've already sent out a 
>> patch.  Just waiting a couple days to let everyone give their input.
>>
> Actually, I'm just don't being clear enough in trying to point out that I
> don't think just setting a default value for "cache" goes far enough. My
> argument has nothing to do with the default value. It has to do with what the
> right thing to do is in specific situations regardless of the value of the
> cache setting.
> 
> My point is that if a file is opened in the guest with the O_DIRECT (or O_DSYNC)
> then QEMU *must* honor that regardless of whatever value the current value of
> "cache" is.
> 
> So, if the system admin for the host decides to set cache=on and something
> in the guest opens a file with O_DIRECT, I feel that it is a violation
> of the system call for the host to cache the write in its local cache w/o
> sending it immediately to the storage subsystem. It must get an ACK from
> the storage subsystem before it can return to the guest in order to preserve
> the guarantee.
> 
> So, if your proposed default value for the cache is in effect, then O_DSYNC
> should provide the write-thru required by the guests use of O_DIRECT on the
> writes.  However, if the default cache value is not used and its set to
> cache=on, and if the guest is using O_DIRECT or O_DSYNC, I feel there are
> issues that need to be addressed.
> 

Everybody agrees that we should support data integrity *by default*. But
please admit that some persons have different needs than yours, and
actually *want* to lie to the guest. We should propose such and option,
with a *big warning*.

-- 
  .''`.  Aurelien Jarno	            | GPG: 1024D/F1BCDB73
 : :' :  Debian developer           | Electrical Engineer
 `. `'   aurel32@debian.org         | aurelien@aurel32.net
   `-    people.debian.org/~aurel32 | www.aurel32.net

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13  2:09               ` Mark Wagner
  2008-10-13  3:16                 ` Anthony Liguori
  2008-10-13  6:42                 ` Aurelien Jarno
@ 2008-10-13 14:38                 ` Steve Ofsthun
  2 siblings, 0 replies; 101+ messages in thread
From: Steve Ofsthun @ 2008-10-13 14:38 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Mark Wagner wrote:
> Anthony Liguori wrote:
>> Mark Wagner wrote:
>>> If you stopped and listened to yourself, you'd see that you are
>>> making my point...
>>>
>>> AFAIK, QEMU is neither designed nor intended to be an Enterprise
>>> Storage Array,
>>> I thought this group is designing a virtualization layer.  However,
>>> the persistent
>>> argument is that since Enterprise Storage products will often
>>> acknowledge a write
>>> before the data is actually on the disk, its OK for QEMU to do the same.
>>
>> I think you're a little lost in this thread.  We're going to have QEMU
>> only acknowledge writes when they complete.  I've already sent out a
>> patch.  Just waiting a couple days to let everyone give their input.
>>
> Actually, I'm just don't being clear enough in trying to point out that I
> don't think just setting a default value for "cache" goes far enough. My
> argument has nothing to do with the default value. It has to do with
> what the
> right thing to do is in specific situations regardless of the value of the
> cache setting.
> 
> My point is that if a file is opened in the guest with the O_DIRECT (or
> O_DSYNC)
> then QEMU *must* honor that regardless of whatever value the current
> value of
> "cache" is.

I disagree here.  QEMU's contract is not with any particular guest OS interface.  QEMU's contract is with the faithfulness of the hardware emulation.  The guest OS must perform appropriate actions that would guarantee the behavior advertised to any particular application.  So your discussion should focus on what should QEMU do when asked to flush an I/O stream on a virtual device.  While the specific actions QEMU might perform may be different based on caching mode, the end result should be host caching flushed to the underlying storage hierarchy.  Note that this still doesn't guarantee the I/O is on the disk unless the storage is configured properly.  QEMU shouldn't attempt to provide stronger guarantees than the host OS provides.

Looking at a parallel in the real world.  Most disk drives today ship with write caching enabled.  Most OSes will accept this and allow delayed writes to the actual media.  Is this completely safe?  No.  Is this accepted?  Yes.  Now, to become safe an application will perform extraordinary actions (various sync modes, etc) to guarantee the data is on the media.  Yet even this can be circumvented by specific performance modes in the storage hierarchy.  However, there are best practices to follow to avoid unexpected vulnerabilities.  For certain application environments is to mandatory to disable writeback caching on the drives.  Yet we wouldn't want to impose this constraint on all application environments.  There are always tradeoffs.

Now given that there are data safety issues to deal with, it is important to prevent a default behavior that recklessly endangers guest data.  A customer will expect a single virtual machine to exhibit the same data safety as a single physical machine.  However, running a group of virtual machines on a single host, the guest user will expect the same reliability as a group of physical machines.  Note that the virtualization layer adds vulnerabilities (a host OS crash for example) that reduce the reliability of the virtual machines over the physical machines they replace.  So the default behavior of a virtualization stack may need to be more conservative that the corresponding physical stack it replaces.

On the flip side though, the virtualization layer can exploit new opportunities for optimization.  Imagine a single macro operation running within a virtual machine (backup, OS installation).  Data integrity of the entire operation is important, not the individual I/Os.  So by disabling all individual I/O synchronization semantics, I get a backup or installation to run in half the time.  This can be a key advantage for virtual deployments.  We don't want to prevent this situation because we want to guarantee the integrity of half a backup, or half an install.

Steve

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-11 20:35     ` Anthony Liguori
  2008-10-12  0:43       ` Mark Wagner
@ 2008-10-12  0:44       ` Chris Wright
  2008-10-12 10:21         ` Avi Kivity
  2008-10-12 10:12       ` Avi Kivity
  2 siblings, 1 reply; 101+ messages in thread
From: Chris Wright @ 2008-10-12  0:44 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

* Anthony Liguori (anthony@codemonkey.ws) wrote:
> Mark Wagner wrote:
>> So I guess the real issue isn't what the default should be (although
>> the performance team at Red Hat would vote for cache=off),
>
> The consensus so far has been that we want to still use the host page  
> cache but use it in write-through mode.  This would mean that the guest  
> would only see data completion when the host's storage subsystem reports  
> the write as having completed.  This is not the same as cache=off but I  
> think gives the real effect that is desired.

I think it's safe to say the perf folks are concerned w/ data integrity
first, stable/reproducible results second, and raw performance third.

So seeing data cached in host was simply not what they expected.  I think
write through is sufficient.  However I think that uncached vs. wt will
show up on the radar under reproducible results (need to tune based on
cache size).  And in most overcommit scenarios memory is typically more
precious than cpu, it's unclear to me if the extra buffering is anything
other than memory overhead.  As long as it's configurable then it's
comparable and benchmarking and best practices can dictate best choice.

thanks,
-chris

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12  0:44       ` Chris Wright
@ 2008-10-12 10:21         ` Avi Kivity
  2008-10-12 14:37           ` Dor Laor
  2008-10-12 17:59           ` Anthony Liguori
  0 siblings, 2 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-12 10:21 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

Chris Wright wrote:
> I think it's safe to say the perf folks are concerned w/ data integrity
> first, stable/reproducible results second, and raw performance third.
>
> So seeing data cached in host was simply not what they expected.  I think
> write through is sufficient.  However I think that uncached vs. wt will
> show up on the radar under reproducible results (need to tune based on
> cache size).  And in most overcommit scenarios memory is typically more
> precious than cpu, it's unclear to me if the extra buffering is anything
> other than memory overhead.  As long as it's configurable then it's
> comparable and benchmarking and best practices can dictate best choice.
>   

Getting good performance because we have a huge amount of free memory in 
the host is not a good benchmark.  Under most circumstances, the free 
memory will be used either for more guests, or will be given to the 
existing guests, which can utilize it more efficiently than the host.

I can see two cases where this is not true:

- using older, 32-bit guests which cannot utilize all of the cache.  I 
think Windows XP is limited to 512MB of cache, and usually doesn't 
utilize even that.  So if you have an application running on 32-bit 
Windows (or on 32-bit Linux with pae disabled), and a huge host, you 
will see a significant boost from cache=writethrough.  This is a case 
where performance can exceed native, simply because native cannot 
exploit all the resources of the host.

- if cache requirements vary in time across the different guests, and if 
some smart ballooning is not in place, having free memory on the host 
means we utilize it for whichever guest has the greatest need, so 
overall performance improves.



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 10:21         ` Avi Kivity
@ 2008-10-12 14:37           ` Dor Laor
  2008-10-12 15:35             ` Jamie Lokier
  2008-10-12 18:02             ` Anthony Liguori
  2008-10-12 17:59           ` Anthony Liguori
  1 sibling, 2 replies; 101+ messages in thread
From: Dor Laor @ 2008-10-12 14:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Avi Kivity wrote:
> Chris Wright wrote:
>> I think it's safe to say the perf folks are concerned w/ data integrity
>> first, stable/reproducible results second, and raw performance third.
>>
>> So seeing data cached in host was simply not what they expected.  I 
>> think
>> write through is sufficient.  However I think that uncached vs. wt will
>> show up on the radar under reproducible results (need to tune based on
>> cache size).  And in most overcommit scenarios memory is typically more
>> precious than cpu, it's unclear to me if the extra buffering is anything
>> other than memory overhead.  As long as it's configurable then it's
>> comparable and benchmarking and best practices can dictate best choice.
>>   
>
> Getting good performance because we have a huge amount of free memory 
> in the host is not a good benchmark.  Under most circumstances, the 
> free memory will be used either for more guests, or will be given to 
> the existing guests, which can utilize it more efficiently than the host.
>
> I can see two cases where this is not true:
>
> - using older, 32-bit guests which cannot utilize all of the cache.  I 
> think Windows XP is limited to 512MB of cache, and usually doesn't 
> utilize even that.  So if you have an application running on 32-bit 
> Windows (or on 32-bit Linux with pae disabled), and a huge host, you 
> will see a significant boost from cache=writethrough.  This is a case 
> where performance can exceed native, simply because native cannot 
> exploit all the resources of the host.
>
> - if cache requirements vary in time across the different guests, and 
> if some smart ballooning is not in place, having free memory on the 
> host means we utilize it for whichever guest has the greatest need, so 
> overall performance improves.
>
>
>
Another justification for ODIRECT is that many production system will 
use the base images for their VMs.
It's mainly true for desktop virtualization but probably for some server 
virtualization deployments.
In these type of scenarios, we can have all of the base image chain 
opened as default with caching for read-only while the
leaf images are open with cache=off.
Since there is ongoing effort (both by IT and developers) to keep the 
base images as big as possible, it guarantees that
this data is best suited for caching in the host while the private leaf 
images will be uncached.
This way we provide good performance and caching for the shared parent 
images while also promising correctness.
Actually this is what happens on mainline qemu with cache=off.

Cheers,
Dor

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 14:37           ` Dor Laor
@ 2008-10-12 15:35             ` Jamie Lokier
  2008-10-12 18:00               ` Anthony Liguori
  2008-10-12 18:02             ` Anthony Liguori
  1 sibling, 1 reply; 101+ messages in thread
From: Jamie Lokier @ 2008-10-12 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Dor Laor wrote:
> Actually this is what happens on mainline qemu with cache=off.

Have I understood right that cache=off on a qcow2 image only uses
O_DIRECT for the leaf image, and the chain of base images don't use
O_DIRECT?

Sometimes on a memory constrained host, where the (collective) guest
memory is nearly as big as the host memory, I'm not sure this is what
I want.

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 15:35             ` Jamie Lokier
@ 2008-10-12 18:00               ` Anthony Liguori
  0 siblings, 0 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-12 18:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Jamie Lokier wrote:
> Dor Laor wrote:
>   
>> Actually this is what happens on mainline qemu with cache=off.
>>     
>
> Have I understood right that cache=off on a qcow2 image only uses
> O_DIRECT for the leaf image, and the chain of base images don't use
> O_DIRECT?
>   

Yeah, that's a bug IMHO and in my patch to add O_DSYNC, I fix that.  I 
think an argument for O_DIRECT in a leaf and wb in the leaf is seriously 
flawed...

Regards,

Anthony Liguori

> Sometimes on a memory constrained host, where the (collective) guest
> memory is nearly as big as the host memory, I'm not sure this is what
> I want.
>
> -- Jamie
>
>
>   

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 14:37           ` Dor Laor
  2008-10-12 15:35             ` Jamie Lokier
@ 2008-10-12 18:02             ` Anthony Liguori
  2008-10-15 10:17               ` Andrea Arcangeli
  1 sibling, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-12 18:02 UTC (permalink / raw)
  To: Dor Laor
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Dor Laor wrote:
> Avi Kivity wrote:
>
> Since there is ongoing effort (both by IT and developers) to keep the 
> base images as big as possible, it guarantees that
> this data is best suited for caching in the host while the private 
> leaf images will be uncached.

A proper CAS solution is really such a better approach.  qcow2 
deduplification is an interesting concept, but such a hack :-)

> This way we provide good performance and caching for the shared parent 
> images while also promising correctness.

You get correctness by using O_DSYNC.  cache=off should disable the use 
of the page cache everywhere.

Regards,

Anthony Liguori

> Actually this is what happens on mainline qemu with cache=off.
>
> Cheers,
> Dor
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 18:02             ` Anthony Liguori
@ 2008-10-15 10:17               ` Andrea Arcangeli
  0 siblings, 0 replies; 101+ messages in thread
From: Andrea Arcangeli @ 2008-10-15 10:17 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]

On Sun, Oct 12, 2008 at 01:02:57PM -0500, Anthony Liguori wrote:
> You get correctness by using O_DSYNC.  cache=off should disable the use of 
> the page cache everywhere.

The parent shared image is generally readonly (assuming no cluster fs
or shared database storage). So O_DSYNC on the parent will be a noop
but it's ok if you like it as a default.

By default having cache enabled on the parent makes sense to me
(O_DSYNC doesn't disable the cache like O_DIRECT does, reads are
cached). Because the qemu command line is qcow2 internals agnostic
(you can't specify which parent/child image to use, that's left to
qemu-img to set on the qcow2 metadata) I guess the O_DIRECT/O_DSYNC
behavior on the parent image should also be left to qemu-img. Assuming
there's any reserved bitflag left in the qcow2 metadata to use to
specify those bits.

I also attached the results of my o_direct measurements. O_DIRECT
seems very optimal already after the fixes to qcow2 to avoid
submitting aio_read/write only large as a qcow2 cluster size. I was
initially fooled because I didn't reduce the ram on the host to the
guest size + less than the min filesize of iozone, after that O_DIRECT
wins. All tests were run with the emulated ide driver, which is the
one that soldice is using right now with non-linux guest. The
aio-thread patch can't make any difference with ide as verified here.

I also tried to enlarge the max dma in the ide driver to 512k (it's
limited to 128k) but I couldn't measure any benefit. 128k large DMA on
host seems enough to reach platter speed.

I also tried with dma disabled on the guest ide driver, and that
destroys the O_DIRECT performance because then the commands are too
small to reach platter speed. The host IDE driver needs something
>=64k to reach platter speed.

In short I think except for the boot-time O_DIRECT is a must and
things like this are why MAP_SHARED isn't nearly as good as O_DIRECT
for certain cases, as it won't waste any cpu in the VM pagetable
manglings and msyncing. So the parent image is the only one where it
makes sense to allow caching to speed up the boot time and application
startup on the shared executables.

[-- Attachment #2: iozone-cleo-trunk-dma.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 37205 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 10:21         ` Avi Kivity
  2008-10-12 14:37           ` Dor Laor
@ 2008-10-12 17:59           ` Anthony Liguori
  2008-10-12 18:34             ` Avi Kivity
  1 sibling, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-12 17:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Avi Kivity wrote:
> Chris Wright wrote:
>> I think it's safe to say the perf folks are concerned w/ data integrity
>> first, stable/reproducible results second, and raw performance third.
>>
>> So seeing data cached in host was simply not what they expected.  I 
>> think
>> write through is sufficient.  However I think that uncached vs. wt will
>> show up on the radar under reproducible results (need to tune based on
>> cache size).  And in most overcommit scenarios memory is typically more
>> precious than cpu, it's unclear to me if the extra buffering is anything
>> other than memory overhead.  As long as it's configurable then it's
>> comparable and benchmarking and best practices can dictate best choice.
>>   
>
> Getting good performance because we have a huge amount of free memory 
> in the host is not a good benchmark.  Under most circumstances, the 
> free memory will be used either for more guests, or will be given to 
> the existing guests, which can utilize it more efficiently than the host.

There's two arguments for O_DIRECT.  The first is that you can avoid 
bringing in data into CPU cache.  This requires zero-copy in QEMU but 
ignoring that, the use of the page cache doesn't necessarily prevent us 
from achieving this.

In the future, most systems will have a DMA offload engine.  This is a 
pretty obvious thing to attempt to accelerate with such an engine which 
would prevent cache pollution.  Another possibility is to directly map 
the host's page cache into the guest's memory space.

The later is a bit tricky but is so much more interesting especially if 
you have a strong storage backend that is capable of deduplification 
(you get memory compaction for free).

I also have my doubts that the amount of memory saved by using O_DIRECT 
will have a noticable impact on performance considering that guest 
memory and page cache memory are entirely reclaimable.  An LRU should 
make the best decisions about whether memory is more valuable for the 
guests or for the host page cache.

Regards,

Anthony Liguori

> I can see two cases where this is not true:
>
> - using older, 32-bit guests which cannot utilize all of the cache.  I 
> think Windows XP is limited to 512MB of cache, and usually doesn't 
> utilize even that.  So if you have an application running on 32-bit 
> Windows (or on 32-bit Linux with pae disabled), and a huge host, you 
> will see a significant boost from cache=writethrough.  This is a case 
> where performance can exceed native, simply because native cannot 
> exploit all the resources of the host.
>
> - if cache requirements vary in time across the different guests, and 
> if some smart ballooning is not in place, having free memory on the 
> host means we utilize it for whichever guest has the greatest need, so 
> overall performance improves.
>
>
>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 17:59           ` Anthony Liguori
@ 2008-10-12 18:34             ` Avi Kivity
  2008-10-12 19:33               ` Izik Eidus
  2008-10-12 19:59               ` Anthony Liguori
  0 siblings, 2 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-12 18:34 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Anthony Liguori wrote:
>>
>> Getting good performance because we have a huge amount of free memory
>> in the host is not a good benchmark.  Under most circumstances, the
>> free memory will be used either for more guests, or will be given to
>> the existing guests, which can utilize it more efficiently than the
>> host.
>
> There's two arguments for O_DIRECT.  The first is that you can avoid
> bringing in data into CPU cache.  This requires zero-copy in QEMU but
> ignoring that, the use of the page cache doesn't necessarily prevent
> us from achieving this.
>
> In the future, most systems will have a DMA offload engine.  This is a
> pretty obvious thing to attempt to accelerate with such an engine
> which would prevent cache pollution.  

But would increase latency, memory bus utilization, and cpu overhead.

In the cases where the page cache buys us something (host page cache
significantly larger than guest size), that's understandable.  But for
the other cases, why bother?  Especially when many systems don't have
this today.

Let me phrase this another way: is there an argument against O_DIRECT? 
In a significant fraction of deployments it will be both simpler and faster.

> Another possibility is to directly map the host's page cache into the
> guest's memory space.
>

Doesn't work with large pages.

> The later is a bit tricky but is so much more interesting especially
> if you have a strong storage backend that is capable of
> deduplification (you get memory compaction for free).
>

It's not free at all.  Replacing a guest memory page involves IPIs and
TLB flushes.  It only works on small pages, and if the host page cache
and guest page cache are aligned with each other.  And with current
Linux memory management, I don't see a way to do it that doesn't involve
creating a vma for every page, which is prohibitively expensive.

> I also have my doubts that the amount of memory saved by using
> O_DIRECT will have a noticable impact on performance considering that
> guest memory and page cache memory are entirely reclaimable.  

O_DIRECT is not about saving memory, it is about saving cpu utilization,
cache utilization, and memory bandwidth.

> An LRU should make the best decisions about whether memory is more
> valuable for the guests or for the host page cache.
>

LRU typically makes fairly bad decisions since it throws most of the
information it has away.  I recommend looking up LRU-K and similar
algorithms, just to get a feel for this; it is basically the simplest
possible algorithm short of random selection.

Note that Linux doesn't even have an LRU; it has to approximate since it
can't sample all of the pages all of the time.  With a hypervisor that
uses Intel's EPT, it's even worse since we don't have an accessed bit.

On silly benchmarks that just exercise the disk and touch no memory, and
if you tune the host very aggresively, LRU will win on long running
guests since it will eventually page out all unused guest memory (with
Linux guests, it will never even page guest memory in).  On real life
applications I don't think there is much chance.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 18:34             ` Avi Kivity
@ 2008-10-12 19:33               ` Izik Eidus
  2008-10-14 17:08                 ` Avi Kivity
  2008-10-12 19:59               ` Anthony Liguori
  1 sibling, 1 reply; 101+ messages in thread
From: Izik Eidus @ 2008-10-12 19:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

Avi Kivity wrote:
>
> LRU typically makes fairly bad decisions since it throws most of the
> information it has away.  I recommend looking up LRU-K and similar
> algorithms, just to get a feel for this; it is basically the simplest
> possible algorithm short of random selection.
>
> Note that Linux doesn't even have an LRU; it has to approximate since it
> can't sample all of the pages all of the time.  With a hypervisor that
> uses Intel's EPT, it's even worse since we don't have an accessed bit.
> On silly benchmarks that just exercise the disk and touch no memory, and
> if you tune the host very aggresively, LRU will win on long running
> guests since it will eventually page out all unused guest memory (with
> Linux guests, it will never even page guest memory in).  On real life
> applications I don't think there is much chance.
>
>   
But when using O_DIRECT you actuality make the pages not swappable at all...
or am i wrong?
maybe somekind of combination with the mm shrink could be good,
do_try_to_free_pages is good point for reference.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 19:33               ` Izik Eidus
@ 2008-10-14 17:08                 ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 17:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Izik Eidus wrote: 
> But when using O_DIRECT you actuality make the pages not swappable at
> all...
> or am i wrong?

Only for the duration of the I/O operation, which is typically very short.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 18:34             ` Avi Kivity
  2008-10-12 19:33               ` Izik Eidus
@ 2008-10-12 19:59               ` Anthony Liguori
  2008-10-12 20:43                 ` Avi Kivity
  1 sibling, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-12 19:59 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Avi Kivity wrote:
> But would increase latency, memory bus utilization, and cpu overhead.
>
> In the cases where the page cache buys us something (host page cache
> significantly larger than guest size), that's understandable.  But for
> the other cases, why bother?  Especially when many systems don't have
> this today.
>
> Let me phrase this another way: is there an argument against O_DIRECT? 
>   

It slows down any user who frequently restarts virtual machines.  It 
slows down total system throughput when there are multiple virtual 
machines sharing a single disk.  This later point is my primary concern 
because in the future, I expect disk sharing to be common in some form 
(either via common QCOW base images or via CAS).

I'd like to see a benchmark demonstrating that O_DIRECT improves overall 
system throughput in any scenario today.  I just don't buy the cost of 
the extra copy today is going to be significant since the CPU cache is 
already polluted.  I think the burden of proof is on O_DIRECT because 
it's quite simple to demonstrate where it hurts performance (just the 
time it takes to do two boots of the same image).

> In a significant fraction of deployments it will be both simpler and faster.
>   

I think this is speculative.  Is there any performance data to back this up?

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 19:59               ` Anthony Liguori
@ 2008-10-12 20:43                 ` Avi Kivity
  2008-10-12 21:11                   ` Anthony Liguori
  0 siblings, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-12 20:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Anthony Liguori wrote:
>>
>> Let me phrase this another way: is there an argument against O_DIRECT?   
>
> It slows down any user who frequently restarts virtual machines.  

This is an important use case (us developers), but not the majority of
deployments.

> It slows down total system throughput when there are multiple virtual
> machines sharing a single disk.  This later point is my primary
> concern because in the future, I expect disk sharing to be common in
> some form (either via common QCOW base images or via CAS).

Sharing via qcow base images is also an important use case, but for
desktop workloads.  Server workloads will be able to share a lot less,
and in any case will not keep reloading their text pages as desktops do.

Regarding CAS, the Linux page cache indexes pages by inode number and
offset, so it cannot share page cache contents without significant
rework.  Perhaps ksm could be adapted to do this, but it can't right
now.  And again, server consolidation scenarios which are mostly
unrelated workloads jammed on a single host won't benefit much from this.

>
> I'd like to see a benchmark demonstrating that O_DIRECT improves
> overall system throughput in any scenario today.  I just don't buy the
> cost of the extra copy today is going to be significant since the CPU
> cache is already polluted.  I think the burden of proof is on O_DIRECT
> because it's quite simple to demonstrate where it hurts performance
> (just the time it takes to do two boots of the same image).
>
>> In a significant fraction of deployments it will be both simpler and
>> faster.
>>   
>
> I think this is speculative.  Is there any performance data to back
> this up?

Given that we don't have a zero-copy implementation yet, it is
impossible to generate real performance data.  However it is backed up
by experience; all major databases use direct I/O and their own caching;
and since the data patterns of filesystems are similar to that of
databases (perhaps less random), there's a case for not caching them.

I'll repeat my arguments:

- cache size

In many deployments we will maximize the number of guests, so host
memory will be low.  If your L3 cache is smaller than your L2 cache,
your cache hit rate will be low.

Guests will write out data they are not expecting to need soon (the
tails of their LRU, or their journals) so caching it is pointless. 
Conversely, they _will_ cache data they have just read.

- cpu cache utilization

When a guest writes out its page cache, this is likely to be some time
after the cpu moved the data there.  So it's out of the page cache.  Now
we're bringing it back to the cache, twice (once reading guest memory,
second time writing to host page cache).

Similarly, when reading from the host page cache into the guest, we have
no idea whether the guest will actually touch the memory in question. 
It may be doing a readahead, or reading a metadata page of which it will
only access a small part.  So again we're wasting two pages worth of
cache per page we're reading.

Note also that we have no idea which vcpu will use the page, so even if
the guest will touch the data, there is a high likelihood (for large
guests) that it will be in the wrong cache.

- conflicting readahead heuristics

The host may attempt to perform readahead on the disk.  However the
guest is also doing readahead, so the host is extending the readahead
further than is likely to be a good idea.  Also, the guest does logical
(file-based) readahead while the host does physical (disk order based)
readahead, or qcow-level readahead which is basically reading random blocks.

Now I don't have data that demonstrates how bad these effects are, but I
think there is sufficient arguments here to justify adding O_DIRECT.  I
intend to recommend O_DIRECT unless I see performance data that favours
O_DSYNC on real world scenarios that take into account bandwidth, cpu
utilization, and memory utilization (i.e. a 1G guest on a 32G host
running fio but not top doesn't count).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 20:43                 ` Avi Kivity
@ 2008-10-12 21:11                   ` Anthony Liguori
  2008-10-14 15:21                     ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-12 21:11 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Avi Kivity wrote:
> Given that we don't have a zero-copy implementation yet, it is
> impossible to generate real performance data.

Which means that it's premature to suggest switching the default to 
O_DIRECT since it's not going to help right now.  It can be revisited 
once we can do zero copy but again, I think it should be driven by 
actual performance data.  My main point is that switching to O_DIRECT 
right now is only going to hurt performance for some users, and most 
likely help no one.

> Now I don't have data that demonstrates how bad these effects are, but I
> think there is sufficient arguments here to justify adding O_DIRECT.  I
> intend to recommend O_DIRECT unless I see performance data that favours
> O_DSYNC on real world scenarios that take into account bandwidth, cpu
> utilization, and memory utilization (i.e. a 1G guest on a 32G host
> running fio but not top doesn't count).
>   

So you intend on recommending something that you don't think is going to 
improve performance today and you know in certain scenarios is going to 
decrease performance?  That doesn't seem right :-)

I'm certainly open to changing the default once we get to a point where 
there's a demonstrable performance improvement from O_DIRECT but since I 
don't think it's a given that there will be, switching now seems like a 
premature optimization which has the side effect of reducing the 
performance of certain users.  That seems like a Bad Thing to me.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 21:11                   ` Anthony Liguori
@ 2008-10-14 15:21                     ` Avi Kivity
  2008-10-14 15:32                       ` Anthony Liguori
  2008-10-14 19:25                       ` Laurent Vivier
  0 siblings, 2 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 15:21 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Anthony Liguori wrote:
> Avi Kivity wrote:
>> Given that we don't have a zero-copy implementation yet, it is
>> impossible to generate real performance data.
>
> Which means that it's premature to suggest switching the default to
> O_DIRECT since it's not going to help right now.  It can be revisited
> once we can do zero copy but again, I think it should be driven by
> actual performance data.  My main point is that switching to O_DIRECT
> right now is only going to hurt performance for some users, and most
> likely help no one.

I am assuming that we will provide true O_DIRECT support soon.

I don't think O_DIRECT should be qemu's default, since anyone using qemu
directly is likely a "causal virtualization" user.  Management systems
like ovirt should definitely default to O_DIRECT (really, they shouldn't
even offer caching).


>> Now I don't have data that demonstrates how bad these effects are, but I
>> think there is sufficient arguments here to justify adding O_DIRECT.  I
>> intend to recommend O_DIRECT unless I see performance data that favours
>> O_DSYNC on real world scenarios that take into account bandwidth, cpu
>> utilization, and memory utilization (i.e. a 1G guest on a 32G host
>> running fio but not top doesn't count).
>>   
>
> So you intend on recommending something that you don't think is going
> to improve performance today and you know in certain scenarios is
> going to decrease performance?  That doesn't seem right :-)
>

In the near term O_DIRECT will increase performance over the alternative.

> I'm certainly open to changing the default once we get to a point
> where there's a demonstrable performance improvement from O_DIRECT but
> since I don't think it's a given that there will be, switching now
> seems like a premature optimization which has the side effect of
> reducing the performance of certain users.  That seems like a Bad
> Thing to me.

I take the opposite view.  O_DIRECT is the, well, direct path to the
hardware.  Caching introduces an additional layer of code and thus needs
to proven effective.  I/O and memory intensive applications use
O_DIRECT; Xen uses O_DIRECT (or equivalent); I don't see why we need to
deviate from industry practice.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-14 15:21                     ` Avi Kivity
@ 2008-10-14 15:32                       ` Anthony Liguori
  2008-10-14 15:43                         ` Avi Kivity
  2008-10-14 19:25                       ` Laurent Vivier
  1 sibling, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-14 15:32 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Avi Kivity wrote:
> I don't think O_DIRECT should be qemu's default, since anyone using qemu
> directly is likely a "causal virtualization" user.  Management systems
> like ovirt should definitely default to O_DIRECT (really, they shouldn't
> even offer caching).
>   

ovirt isn't a good example because the default storage model is iSCSI.  
Since you aren't preserving zero-copy, I doubt that you'll see any 
advantage to using O_DIRECT (I suspect the code paths aren't even 
different).

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-14 15:32                       ` Anthony Liguori
@ 2008-10-14 15:43                         ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 15:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Anthony Liguori wrote:
> Avi Kivity wrote:
>> I don't think O_DIRECT should be qemu's default, since anyone using qemu
>> directly is likely a "causal virtualization" user.  Management systems
>> like ovirt should definitely default to O_DIRECT (really, they shouldn't
>> even offer caching).
>>   
>
> ovirt isn't a good example because the default storage model is
> iSCSI.  Since you aren't preserving zero-copy, I doubt that you'll see
> any advantage to using O_DIRECT (I suspect the code paths aren't even
> different).

If you have a hardware iSCSI initiator then O_DIRECT pays off.  Even for
a software initiator, the write path could be made zero copy.  The read
path doesn't look good though.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-14 15:21                     ` Avi Kivity
  2008-10-14 15:32                       ` Anthony Liguori
@ 2008-10-14 19:25                       ` Laurent Vivier
  2008-10-16  9:47                         ` Avi Kivity
  1 sibling, 1 reply; 101+ messages in thread
From: Laurent Vivier @ 2008-10-14 19:25 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, qemu-devel, Ryan Harper

Le mardi 14 octobre 2008 à 17:21 +0200, Avi Kivity a écrit :
> Anthony Liguori wrote:
> > Avi Kivity wrote:
> >> Given that we don't have a zero-copy implementation yet, it is
> >> impossible to generate real performance data.
> >
> > Which means that it's premature to suggest switching the default to
> > O_DIRECT since it's not going to help right now.  It can be revisited
> > once we can do zero copy but again, I think it should be driven by
> > actual performance data.  My main point is that switching to O_DIRECT
> > right now is only going to hurt performance for some users, and most
> > likely help no one.
> 
> I am assuming that we will provide true O_DIRECT support soon.

If you remember, I tried to introduce zero copy when I wrote the
"cache=off" patch:

http://thread.gmane.org/gmane.comp.emulators.qemu/22148/focus=22149

but it was not correct (see Fabrice comment).

Laurent
-- 
------------------ Laurent.Vivier@bull.net  ------------------
"Tout ce qui est impossible reste à accomplir"    Jules Verne
"Things are only impossible until they're not" Jean-Luc Picard

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-14 19:25                       ` Laurent Vivier
@ 2008-10-16  9:47                         ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-16  9:47 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, qemu-devel, Ryan Harper

Laurent Vivier wrote:
>> I am assuming that we will provide true O_DIRECT support soon.
>>     
>
> If you remember, I tried to introduce zero copy when I wrote the
> "cache=off" patch:
>
> http://thread.gmane.org/gmane.comp.emulators.qemu/22148/focus=22149
>
> but it was not correct (see Fabrice comment).
>   

Yes, this is not trivial, especially if we want to provide good support 
for all qemu targets.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-11 20:35     ` Anthony Liguori
  2008-10-12  0:43       ` Mark Wagner
  2008-10-12  0:44       ` Chris Wright
@ 2008-10-12 10:12       ` Avi Kivity
  2008-10-17 13:20         ` Jens Axboe
  2 siblings, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-12 10:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Anthony Liguori wrote:
>>
>> If I focus on the sentence "The I/O is synchronous, that is, at
>> the completion of a read(2) or write(2), data is guaranteed to have
>> been transferred. ",
>
> It's extremely important to understand what the guarantee is.  The 
> guarantee is that upon completion on write(), the data will have been 
> reported as written by the underlying storage subsystem.  This does 
> *not* mean that the data is on disk.
>

It means that as far as the block-io layer of the kernel is concerned, 
the guarantee is met.  If the writes go to to a ramdisk, or to an IDE 
drive with write-back cache enabled, or to disk with write-back cache 
disabled but without redundancy, or to a high-end storage array with 
double-parity protection but without a continuous data protection 
offsite solution, things may still go wrong.

It is up to qemu to provide a strong link in the data reliability chain, 
not to ensure that the entire chain is perfect.  That's up to the 
administrator or builder of the system.

> If you have a normal laptop, your disk has a cache.  That cache does 
> not have a battery backup.  Under normal operations, the cache is 
> acting in write-back mode and when you do a write, the disk will 
> report the write as completed even though it is not actually on disk.  
> If you really care about the data being on disk, you have to either 
> use a disk with a battery backed cache (much more expensive) or enable 
> write-through caching (will significantly reduce performance).
>

I think that with SATA NCQ, this is no longer true.  The drive will 
report the write complete when it is on disk, and utilize multiple 
outstanding requests to get coalescing and reordering.  Not sure about 
this, though -- some drives may still be lying.

> In the case of KVM, even using write-back caching with the host page 
> cache, we are still honoring the guarantee of O_DIRECT.  We just have 
> another level of caching that happens to be write-back.

No, we are lying.  That's fine if the user tells us to lie, but not 
otherwise.

>> I think there a bug here. If I open a
>> file with the O_DIRECT flag and the host reports back to me that
>> the transfer has completed when in fact its still in the host cache,
>> its a bug as it violates the open()/write() call and there is no
>> guarantee that the data will actually be written.
>
> This is very important, O_DIRECT does *not* guarantee that data 
> actually resides on disk.  There are many possibly places that it can 
> be cached (in the storage controller, in the disks themselves, in a 
> RAID controller).

O_DIRECT guarantees that the kernel is not the weak link in the 
reliability chain.

>
>> So I guess the real issue isn't what the default should be (although
>> the performance team at Red Hat would vote for cache=off),
>
> The consensus so far has been that we want to still use the host page 
> cache but use it in write-through mode.  This would mean that the 
> guest would only see data completion when the host's storage subsystem 
> reports the write as having completed.  This is not the same as 
> cache=off but I think gives the real effect that is desired.

I am fine with write-through as default, but cache=off should be a 
supported option.

>
> Do you have another argument for using cache=off?

Performance.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 10:12       ` Avi Kivity
@ 2008-10-17 13:20         ` Jens Axboe
  2008-10-19  9:01           ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Jens Axboe @ 2008-10-17 13:20 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

On Sun, Oct 12 2008, Avi Kivity wrote:
> >If you have a normal laptop, your disk has a cache.  That cache does 
> >not have a battery backup.  Under normal operations, the cache is 
> >acting in write-back mode and when you do a write, the disk will 
> >report the write as completed even though it is not actually on disk.  
> >If you really care about the data being on disk, you have to either 
> >use a disk with a battery backed cache (much more expensive) or enable 
> >write-through caching (will significantly reduce performance).
> >
> 
> I think that with SATA NCQ, this is no longer true.  The drive will 
> report the write complete when it is on disk, and utilize multiple 
> outstanding requests to get coalescing and reordering.  Not sure about 

It is still very true. Go buy any consumer drive on the market and check
the write cache settings - hint, it's definitely shipped with write back
caching. So while the drive may have NCQ and Linux will use it, the
write cache is still using write back unless you explicitly change it.

> this, though -- some drives may still be lying.

I think this is largely an urban myth, at least I've never come across
any drives that lie. It's easy enough to test, modulo firmware bugs.
Just switch to write through and compare the random write iops rate. Or
enable write barriers in Linux and do the same workload, compare iops
rate again with write back caching.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-17 13:20         ` Jens Axboe
@ 2008-10-19  9:01           ` Avi Kivity
  2008-10-19 18:10             ` Jens Axboe
  0 siblings, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-19  9:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Jens Axboe wrote:
> On Sun, Oct 12 2008, Avi Kivity wrote:
>   
>>> If you have a normal laptop, your disk has a cache.  That cache does 
>>> not have a battery backup.  Under normal operations, the cache is 
>>> acting in write-back mode and when you do a write, the disk will 
>>> report the write as completed even though it is not actually on disk.  
>>> If you really care about the data being on disk, you have to either 
>>> use a disk with a battery backed cache (much more expensive) or enable 
>>> write-through caching (will significantly reduce performance).
>>>
>>>       
>> I think that with SATA NCQ, this is no longer true.  The drive will 
>> report the write complete when it is on disk, and utilize multiple 
>> outstanding requests to get coalescing and reordering.  Not sure about 
>>     
>
> It is still very true. Go buy any consumer drive on the market and check
> the write cache settings - hint, it's definitely shipped with write back
> caching. So while the drive may have NCQ and Linux will use it, the
> write cache is still using write back unless you explicitly change it.
>
>   

Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
user explicitly enables it, if NCQ is available?  NCQ should provide 
acceptable throughput even without the write cache.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19  9:01           ` Avi Kivity
@ 2008-10-19 18:10             ` Jens Axboe
  2008-10-19 18:23               ` Avi Kivity
  2008-10-19 18:24               ` Avi Kivity
  0 siblings, 2 replies; 101+ messages in thread
From: Jens Axboe @ 2008-10-19 18:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

On Sun, Oct 19 2008, Avi Kivity wrote:
> Jens Axboe wrote:
> >On Sun, Oct 12 2008, Avi Kivity wrote:
> >  
> >>>If you have a normal laptop, your disk has a cache.  That cache does 
> >>>not have a battery backup.  Under normal operations, the cache is 
> >>>acting in write-back mode and when you do a write, the disk will 
> >>>report the write as completed even though it is not actually on disk.  
> >>>If you really care about the data being on disk, you have to either 
> >>>use a disk with a battery backed cache (much more expensive) or enable 
> >>>write-through caching (will significantly reduce performance).
> >>>
> >>>      
> >>I think that with SATA NCQ, this is no longer true.  The drive will 
> >>report the write complete when it is on disk, and utilize multiple 
> >>outstanding requests to get coalescing and reordering.  Not sure about 
> >>    
> >
> >It is still very true. Go buy any consumer drive on the market and check
> >the write cache settings - hint, it's definitely shipped with write back
> >caching. So while the drive may have NCQ and Linux will use it, the
> >write cache is still using write back unless you explicitly change it.
> >
> >  
> 
> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
> user explicitly enables it, if NCQ is available?  NCQ should provide 
> acceptable throughput even without the write cache.

How can it be a bug? Changing the cache policy of a drive would be a
policy decision in the kernel, that is never the right thing to do.
There's no such thing as 'acceptable throughput', manufacturers and
customers usually just want the go faster stripes and data consistency
is second. Additionally, write back caching is perfectly safe, if used
with a barrier enabled file system in Linux.

Also note that most users will not have deep queuing for most things. To
get good random write performance with write through caching and NCQ,
you naturally need to be able to fill the drive queue most of the time.
Most desktop workloads don't come close to that, so the user will
definitely see it as slower.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 18:10             ` Jens Axboe
@ 2008-10-19 18:23               ` Avi Kivity
  2008-10-19 19:17                 ` M. Warner Losh
  2008-10-19 18:24               ` Avi Kivity
  1 sibling, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-19 18:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Jens Axboe wrote:

>> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
>> user explicitly enables it, if NCQ is available?  NCQ should provide 
>> acceptable throughput even without the write cache.
>>     
>
> How can it be a bug? 

If it puts my data at risk, it's a bug.  I can understand it for IDE,
but not for SATA with NCQ.

> Changing the cache policy of a drive would be a
> policy decision in the kernel, 

If you don't want this in the kernel, then the system as a whole should
default to being safe.  Though in this case I think it is worthwhile to
do this in the kernel.

> that is never the right thing to do.
> There's no such thing as 'acceptable throughput',

I meant that performance is not completely destroyed.  How can you even
compare data safety to some percent of performance?

>  manufacturers and
> customers usually just want the go faster stripes and data consistency
> is second. 

What is the performance impact of disabling the write cache, given
enough queue depth?

> Additionally, write back caching is perfectly safe, if used
> with a barrier enabled file system in Linux.
>   

Not all Linux filesystems are barrier enabled, AFAIK.  Further, barriers
don't help with O_DIRECT (right?).

I shouldn't need a disk array to run a database.

> Also note that most users will not have deep queuing for most things. To
> get good random write performance with write through caching and NCQ,
> you naturally need to be able to fill the drive queue most of the time.
> Most desktop workloads don't come close to that, so the user will
> definitely see it as slower.
>   

Most desktop workloads use writeback cache, so write performance is not
critical.  However I'd hate to see my data destroyed by a power failure,
and today's large caches can hold a bunch of data.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 18:23               ` Avi Kivity
@ 2008-10-19 19:17                 ` M. Warner Losh
  2008-10-19 19:31                   ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: M. Warner Losh @ 2008-10-19 19:17 UTC (permalink / raw)
  To: qemu-devel, avi; +Cc: chrisw, markmc, kvm-devel, Laurent.Vivier, ryanh

In message: <48FB7B26.2090903@redhat.com>
            Avi Kivity <avi@redhat.com> writes:
: >> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
: >> user explicitly enables it, if NCQ is available?  NCQ should provide 
: >> acceptable throughput even without the write cache.
: >>     
: >
: > How can it be a bug? 
: 
: If it puts my data at risk, it's a bug.  I can understand it for IDE,
: but not for SATA with NCQ.

So wouldn't async mounts by default be a bug too?

Warner

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 19:17                 ` M. Warner Losh
@ 2008-10-19 19:31                   ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-19 19:31 UTC (permalink / raw)
  To: M. Warner Losh
  Cc: chrisw, markmc, kvm-devel, Laurent.Vivier, qemu-devel, ryanh

M. Warner Losh wrote:
> In message: <48FB7B26.2090903@redhat.com>
>             Avi Kivity <avi@redhat.com> writes:
> : >> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
> : >> user explicitly enables it, if NCQ is available?  NCQ should provide 
> : >> acceptable throughput even without the write cache.
> : >>     
> : >
> : > How can it be a bug? 
> : 
> : If it puts my data at risk, it's a bug.  I can understand it for IDE,
> : but not for SATA with NCQ.
>
> So wouldn't async mounts by default be a bug too?
>   

No.  Applications which are worried about data integrity use fsync() or 
backups to protect the user.

I'm not worried about losing a few minutes of openoffice.org work.  I'm 
worried about mail systems, filesystem metadata, etc. which can easily 
lose a large amount of data which is hard to recover.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 18:10             ` Jens Axboe
  2008-10-19 18:23               ` Avi Kivity
@ 2008-10-19 18:24               ` Avi Kivity
  2008-10-19 18:36                 ` Jens Axboe
  1 sibling, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-19 18:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Jens Axboe wrote:

>> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
>> user explicitly enables it, if NCQ is available?  NCQ should provide 
>> acceptable throughput even without the write cache.
>>     
>
> How can it be a bug? 

If it puts my data at risk, it's a bug.  I can understand it for IDE,
but not for SATA with NCQ.

> Changing the cache policy of a drive would be a
> policy decision in the kernel, 

If you don't want this in the kernel, then the system as a whole should
default to being safe.  Though in this case I think it is worthwhile to
do this in the kernel.

> that is never the right thing to do.
> There's no such thing as 'acceptable throughput',

I meant that performance is not completely destroyed.  How can you even
compare data safety to some percent of performance?

>  manufacturers and
> customers usually just want the go faster stripes and data consistency
> is second. 

What is the performance impact of disabling the write cache, given
enough queue depth?

> Additionally, write back caching is perfectly safe, if used
> with a barrier enabled file system in Linux.
>   

Not all Linux filesystems are barrier enabled, AFAIK.  Further, barriers
don't help with O_DIRECT (right?).

I shouldn't need a disk array to run a database.

> Also note that most users will not have deep queuing for most things. To
> get good random write performance with write through caching and NCQ,
> you naturally need to be able to fill the drive queue most of the time.
> Most desktop workloads don't come close to that, so the user will
> definitely see it as slower.
>   

Most desktop workloads use writeback cache, so write performance is not
critical.  However I'd hate to see my data destroyed by a power failure,
and today's large caches can hold a bunch of data.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 18:24               ` Avi Kivity
@ 2008-10-19 18:36                 ` Jens Axboe
  2008-10-19 19:11                   ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Jens Axboe @ 2008-10-19 18:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

On Sun, Oct 19 2008, Avi Kivity wrote:
> Jens Axboe wrote:
> 
>  
> 
> >> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
> >> user explicitly enables it, if NCQ is available?  NCQ should provide 
> >> acceptable throughput even without the write cache.
> >>     
> >
> > How can it be a bug? 
> 
> If it puts my data at risk, it's a bug.  I can understand it for IDE,
> but not for SATA with NCQ.

Then YOU turn it off. Other people would consider the lousy performance
to be the bigger problem. See policy :-)

> > Changing the cache policy of a drive would be a
> > policy decision in the kernel, 
> 
> If you don't want this in the kernel, then the system as a whole should
> default to being safe.  Though in this case I think it is worthwhile to
> do this in the kernel.

Doesn't matter how you turn this, it's still a policy decision. Leave it
to the user. It's not exactly a new turn of events, commodity drives
have shipped with write caching on forever. What if the drive has a
battery backing? What if the user has an UPS?

> > that is never the right thing to do.
> > There's no such thing as 'acceptable throughput',
> 
> I meant that performance is not completely destroyed.  How can you even

How do you know it's not destroyed? Depending on your workload, it may
very well be dropping your throughput by orders of magnitude.

> compare data safety to some percent of performance?

I'm not, what I'm saying is that different people will have different
opponions on what is most important. Do note that the window of
corruption is really small and requires powerloss to trigger. So for
most desktop users, the tradeoff is actually sane.

> >  manufacturers and
> > customers usually just want the go faster stripes and data consistency
> > is second. 
> 
> What is the performance impact of disabling the write cache, given
> enough queue depth?

Depends on the drive. On commodity drives, manufacturers don't really
optimize much for the write through caching, since it's not really what
anybody uses. So you'd have to benchmark it to see.

> > Additionally, write back caching is perfectly safe, if used
> > with a barrier enabled file system in Linux.
> >   
> 
> Not all Linux filesystems are barrier enabled, AFAIK.  Further, barriers
> don't help with O_DIRECT (right?).

O_DIRECT should just use FUA writes, there are safe with write back
caching. I'm actually testing such a change just to gauge the
performance impact.

> I shouldn't need a disk array to run a database.

You are free to turn off write back caching!

> > Also note that most users will not have deep queuing for most things. To
> > get good random write performance with write through caching and NCQ,
> > you naturally need to be able to fill the drive queue most of the time.
> > Most desktop workloads don't come close to that, so the user will
> > definitely see it as slower.
> >   
> 
> Most desktop workloads use writeback cache, so write performance is not
> critical.

Ehm, how do you reach that conclusion based on that statement?

> However I'd hate to see my data destroyed by a power failure, and
> today's large caches can hold a bunch of data.

Then you use barriers or turn write back caching off, simple as that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 18:36                 ` Jens Axboe
@ 2008-10-19 19:11                   ` Avi Kivity
  2008-10-19 19:30                     ` Jens Axboe
  0 siblings, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-19 19:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Jens Axboe wrote:
> On Sun, Oct 19 2008, Avi Kivity wrote:
>   
>> Jens Axboe wrote:
>>
>>  
>>
>>     
>>>> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
>>>> user explicitly enables it, if NCQ is available?  NCQ should provide 
>>>> acceptable throughput even without the write cache.
>>>>     
>>>>         
>>> How can it be a bug? 
>>>       
>> If it puts my data at risk, it's a bug.  I can understand it for IDE,
>> but not for SATA with NCQ.
>>     
>
> Then YOU turn it off. Other people would consider the lousy performance
> to be the bigger problem. See policy :-)
>
>   

If I get lousy performance, I can turn on the write cache and ignore the
risk of data loss.  If I lose my data, I can't turn off the write cache
and get my data back.

(it seems I can't turn off the write cache even without losing my data:

[avi@firebolt ~]$ sudo sdparm --set=WCE=0 /dev/sd[ab]
    /dev/sda: ATA       WDC WD3200YS-01P  21.0
change_mode_page: failed setting page: Caching (SBC)
    /dev/sdb: ATA       WDC WD3200YS-01P  21.0
change_mode_page: failed setting page: Caching (SBC)
)

>>> Changing the cache policy of a drive would be a
>>> policy decision in the kernel, 
>>>       
>> If you don't want this in the kernel, then the system as a whole should
>> default to being safe.  Though in this case I think it is worthwhile to
>> do this in the kernel.
>>     
>
> Doesn't matter how you turn this, it's still a policy decision. Leave it
> to the user. It's not exactly a new turn of events, commodity drives
> have shipped with write caching on forever. What if the drive has a
> battery backing? 

If the drive has a batter backup, I'd argue it should report it as a
write-through cache.  I'm not a drive manufacturer though.

> What if the user has an UPS?
>
>   

They should enable the write-back cache if they trust the UPS.  Or maybe
the system should do that automatically if it's aware of the UPS.

"Policy" doesn't mean you shouldn't choose good defaults.

>>> that is never the right thing to do.
>>> There's no such thing as 'acceptable throughput',
>>>       
>> I meant that performance is not completely destroyed.  How can you even
>>     
>
> How do you know it's not destroyed? Depending on your workload, it may
> very well be dropping your throughput by orders of magnitude.
>
>   

I guess this is the crux.  According to my understanding, you shouldn't
see such a horrible drop, unless the application does synchronous writes
explicitly, in which case it is probably worried about data safety.

>> compare data safety to some percent of performance?
>>     
>
> I'm not, what I'm saying is that different people will have different
> opponions on what is most important. Do note that the window of
> corruption is really small and requires powerloss to trigger. So for
> most desktop users, the tradeoff is actually sane.
>
>   

I agree that the window is very small, and that by eliminating software
failures we get rid of the major source of data loss.  What I don't know
is what the performance tradeoff looks like (and I can't measure since
my drives won't let me turn off the cache for some reason).

>>> Additionally, write back caching is perfectly safe, if used
>>> with a barrier enabled file system in Linux.
>>>   
>>>       
>> Not all Linux filesystems are barrier enabled, AFAIK.  Further, barriers
>> don't help with O_DIRECT (right?).
>>     
>
> O_DIRECT should just use FUA writes, there are safe with write back
> caching. I'm actually testing such a change just to gauge the
> performance impact.
>   

You mean, this is not in mainline yet?

So, with this, plus barrier support for metadata and O_SYNC writes, the
write-back cache should be safe?

Some googling shows that Windows XP introduced FUA for O_DIRECT and
metadata writes as well.

>   
>> I shouldn't need a disk array to run a database.
>>     
>
> You are free to turn off write back caching!
>
>   

What about the users who aren't on qemu-devel?

However, with your FUA change, they should be safe.

>>
>> Most desktop workloads use writeback cache, so write performance is not
>> critical.
>>     
>
> Ehm, how do you reach that conclusion based on that statement?
>
>   

Any write latency is buffered by the kernel.  Write speed is main memory
speed.  Disk speed only bubbles up when memory is tight.

>> However I'd hate to see my data destroyed by a power failure, and
>> today's large caches can hold a bunch of data.
>>     
>
> Then you use barriers or turn write back caching off, simple as that.
>   

I will (if I figure out how) but there may be one or two users who
haven't read the scsi spec yet.

Or more correctly, I am revising my opinion of the write back cache
since even when it is enabled, it is completely optional.  Instead of
disabling the write back cache we should use FUA and barriers, and since
you are to be working on FUA, it looks like this will be resolved soon
without performance/correctness compromises.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 19:11                   ` Avi Kivity
@ 2008-10-19 19:30                     ` Jens Axboe
  2008-10-19 20:16                       ` Avi Kivity
  2008-10-20 14:14                       ` Avi Kivity
  0 siblings, 2 replies; 101+ messages in thread
From: Jens Axboe @ 2008-10-19 19:30 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

On Sun, Oct 19 2008, Avi Kivity wrote:
> Jens Axboe wrote:
> > On Sun, Oct 19 2008, Avi Kivity wrote:
> >   
> >> Jens Axboe wrote:
> >>
> >>  
> >>
> >>     
> >>>> Sounds like a bug.  Shouldn't Linux disable the write cache unless the 
> >>>> user explicitly enables it, if NCQ is available?  NCQ should provide 
> >>>> acceptable throughput even without the write cache.
> >>>>     
> >>>>         
> >>> How can it be a bug? 
> >>>       
> >> If it puts my data at risk, it's a bug.  I can understand it for IDE,
> >> but not for SATA with NCQ.
> >>     
> >
> > Then YOU turn it off. Other people would consider the lousy performance
> > to be the bigger problem. See policy :-)
> >
> >   
> 
> If I get lousy performance, I can turn on the write cache and ignore the
> risk of data loss.  If I lose my data, I can't turn off the write cache
> and get my data back.
> 
> (it seems I can't turn off the write cache even without losing my data:
> 
> [avi@firebolt ~]$ sudo sdparm --set=WCE=0 /dev/sd[ab]
>     /dev/sda: ATA       WDC WD3200YS-01P  21.0
> change_mode_page: failed setting page: Caching (SBC)
>     /dev/sdb: ATA       WDC WD3200YS-01P  21.0
> change_mode_page: failed setting page: Caching (SBC)

Use hdparm, it's an ATA drive even if Linux currently uses the scsi
layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi
disk sysfs directory.

> >>> Changing the cache policy of a drive would be a
> >>> policy decision in the kernel, 
> >>>       
> >> If you don't want this in the kernel, then the system as a whole should
> >> default to being safe.  Though in this case I think it is worthwhile to
> >> do this in the kernel.
> >>     
> >
> > Doesn't matter how you turn this, it's still a policy decision. Leave it
> > to the user. It's not exactly a new turn of events, commodity drives
> > have shipped with write caching on forever. What if the drive has a
> > battery backing? 
> 
> If the drive has a batter backup, I'd argue it should report it as a
> write-through cache.  I'm not a drive manufacturer though.

You could argue that, but that could influence other decision making.
FWIW, we've discussed this very issue for YEARS, reiterating the debate
here isn't likely to change much...

> > What if the user has an UPS?
> >
> >   
> 
> They should enable the write-back cache if they trust the UPS.  Or maybe
> the system should do that automatically if it's aware of the UPS.
> 
> "Policy" doesn't mean you shouldn't choose good defaults.

Changing the hardware settings for this kind of behaviour IS most
certainly policy.

> >>> that is never the right thing to do.
> >>> There's no such thing as 'acceptable throughput',
> >>>       
> >> I meant that performance is not completely destroyed.  How can you even
> >>     
> >
> > How do you know it's not destroyed? Depending on your workload, it may
> > very well be dropping your throughput by orders of magnitude.
> >
> >   
> 
> I guess this is the crux.  According to my understanding, you shouldn't
> see such a horrible drop, unless the application does synchronous writes
> explicitly, in which case it is probably worried about data safety.

Then you need to adjust your understanding, because you definitely will
see a big drop in performance.

> >> compare data safety to some percent of performance?
> >>     
> >
> > I'm not, what I'm saying is that different people will have different
> > opponions on what is most important. Do note that the window of
> > corruption is really small and requires powerloss to trigger. So for
> > most desktop users, the tradeoff is actually sane.
> >
> >   
> 
> I agree that the window is very small, and that by eliminating software
> failures we get rid of the major source of data loss.  What I don't know
> is what the performance tradeoff looks like (and I can't measure since
> my drives won't let me turn off the cache for some reason).
> 
> >>> Additionally, write back caching is perfectly safe, if used
> >>> with a barrier enabled file system in Linux.
> >>>   
> >>>       
> >> Not all Linux filesystems are barrier enabled, AFAIK.  Further, barriers
> >> don't help with O_DIRECT (right?).
> >>     
> >
> > O_DIRECT should just use FUA writes, there are safe with write back
> > caching. I'm actually testing such a change just to gauge the
> > performance impact.
> >   
> 
> You mean, this is not in mainline yet?

It isn't.

> So, with this, plus barrier support for metadata and O_SYNC writes, the
> write-back cache should be safe?

Yes, and fsync() as well provided the fs does a flush there too.

> Some googling shows that Windows XP introduced FUA for O_DIRECT and
> metadata writes as well.

There's a lot of other background information to understand to gauge the
impact of using eg FUA for O_DIRECT in Linux as well. MS basically wrote
the FUA for ATA proposal, and the original usage pattern (as far as I
remember) was indeed meta data. Hence it also imposes a priority boost
in most (all?) drive firmwares, since it's deemed important. So just
using FUA vs non-FUA is likely to impact performance of other workloads
in fairly unknown ways. FUA on non-queuing drives will also likely suck
for performance, since you're basically going to be blowing a drive rev
for each IO. And that hurts.

> >> I shouldn't need a disk array to run a database.
> >>     
> >
> > You are free to turn off write back caching!
> >
> >   
> 
> What about the users who aren't on qemu-devel?

It may be news to you, but it has been debated on lkml in the past as
well. Not even that long ago, and I'd be surprised of lwn didn't run
some article on it as well. But I agree it's important information, but
realize that until just recently most people didn't really consider it a
likely scenario in practice...

I wrote and committed the original barrier implementation in Linux in
2001, and just this year XFS made it a default mount option. After the
recent debacle on this on lkml, ext4 made it the default as well.

So let me turn it around a bit - if this issue really did hit lots of
people out there in real life, don't you think there would have been
more noise about this and we would have made this the default years ago?
So while we both agree it's a risk, it's not a huuuge risk...

> However, with your FUA change, they should be safe.

Yes, that would make O_DIRECT safe always. Except when it falls back to
buffered IO, woops...

> >> Most desktop workloads use writeback cache, so write performance is not
> >> critical.
> >>     
> >
> > Ehm, how do you reach that conclusion based on that statement?
> >
> >   
> 
> Any write latency is buffered by the kernel.  Write speed is main memory
> speed.  Disk speed only bubbles up when memory is tight.

That's a nice theory, in practice that is completely wrong. You end up
waiting on writes for LOTS of other reasons!

> >> However I'd hate to see my data destroyed by a power failure, and
> >> today's large caches can hold a bunch of data.
> >>     
> >
> > Then you use barriers or turn write back caching off, simple as that.
> >   
> 
> I will (if I figure out how) but there may be one or two users who
> haven't read the scsi spec yet.

A newish hdparm should work, or the sysfs attribute. hdparm will
pass-through the real ata command to do this, the sysfs approach (and
sdparm) requires MODE_SENSE and MODE_SELECT transformation of that page.

> Or more correctly, I am revising my opinion of the write back cache
> since even when it is enabled, it is completely optional.  Instead of
> disabling the write back cache we should use FUA and barriers, and since
> you are to be working on FUA, it looks like this will be resolved soon
> without performance/correctness compromises.

Lets see how the testing goes :-)
Possibly just enabled FUA O_DIRECT with barriers, that'll likely be a
good default.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 19:30                     ` Jens Axboe
@ 2008-10-19 20:16                       ` Avi Kivity
  2008-10-20 14:14                       ` Avi Kivity
  1 sibling, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-19 20:16 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Jens Axboe wrote:
>> (it seems I can't turn off the write cache even without losing my data:
>>     
> Use hdparm, it's an ATA drive even if Linux currently uses the scsi
> layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi
> disk sysfs directory.
>   

Ok.  It's moot anyway.

>> "Policy" doesn't mean you shouldn't choose good defaults.
>>     
>
> Changing the hardware settings for this kind of behaviour IS most
> certainly policy.
>   

Leaving bad hardware settings is also policy.  But in light of FUA, the 
SCSI write cache is not a bad thing, so we should definitely leave it on.

>> I guess this is the crux.  According to my understanding, you shouldn't
>> see such a horrible drop, unless the application does synchronous writes
>> explicitly, in which case it is probably worried about data safety.
>>     
>
> Then you need to adjust your understanding, because you definitely will
> see a big drop in performance.
>
>   

Can you explain why?  This is interesting.

>>> O_DIRECT should just use FUA writes, there are safe with write back
>>> caching. I'm actually testing such a change just to gauge the
>>> performance impact.
>>>   
>>>       
>> You mean, this is not in mainline yet?
>>     
>
> It isn't.
>   

What is the time frame for this? 2.6.29?

>> Some googling shows that Windows XP introduced FUA for O_DIRECT and
>> metadata writes as well.
>>     
>
> There's a lot of other background information to understand to gauge the
> impact of using eg FUA for O_DIRECT in Linux as well. MS basically wrote
> the FUA for ATA proposal, and the original usage pattern (as far as I
> remember) was indeed meta data. Hence it also imposes a priority boost
> in most (all?) drive firmwares, since it's deemed important. So just
> using FUA vs non-FUA is likely to impact performance of other workloads
> in fairly unknown ways. FUA on non-queuing drives will also likely suck
> for performance, since you're basically going to be blowing a drive rev
> for each IO. And that hurts.
>   

Let's assume queueing drives, since these are fairly common these days.  
So qemu issuing O_DIRECT which turns into FUA writes is safe but 
suboptimal.  Has there been talk about exposing the difference between 
FUA writes and cached writes to userspace?  What about barriers?

With a rich enough userspace interface, qemu can communicate the 
intentions of the guest and not force the kernel to make a 
performance/correctness tradeoff.

>>
>> What about the users who aren't on qemu-devel?
>>     
>
> It may be news to you, but it has been debated on lkml in the past as
> well. Not even that long ago, and I'd be surprised of lwn didn't run
> some article on it as well. 

Let's postulate the existence of a user that doesn't read lkml or even lwn.

> But I agree it's important information, but
> realize that until just recently most people didn't really consider it a
> likely scenario in practice...
>
> I wrote and committed the original barrier implementation in Linux in
> 2001, and just this year XFS made it a default mount option. After the
> recent debacle on this on lkml, ext4 made it the default as well.
>
> So let me turn it around a bit - if this issue really did hit lots of
> people out there in real life, don't you think there would have been
> more noise about this and we would have made this the default years ago?
> So while we both agree it's a risk, it's not a huuuge risk...
>   

I agree, not a huge risk.  I guess compared to the rest of the suckiness 
involved (took a long while just to get journalling), this is really a 
minor issue.  It's interesting though that Windows supported this in 
2001, seven years ago, so at least they considered it important.

I guess I'm sensitive to this because in my filesystemy past QA would 
jerk out data and power cables while running various tests and act 
surprised whenever data was lost.  So I'm allergic to data loss.

With qemu (at least when used with a hypervisor) we have to be extra 
safe since we have no idea what workload is running and how critical 
data safety is.  Well, we have hints (whether FUA is set or not) when 
using SCSI, but right now we don't have a way of communicating these 
hints to the kernel.

One important takeaway is to find out whether virtio-blk supports FUA, 
and if not, add it.

>> However, with your FUA change, they should be safe.
>>     
>
> Yes, that would make O_DIRECT safe always. Except when it falls back to
> buffered IO, woops...
>
>   

Woops.

>> Any write latency is buffered by the kernel.  Write speed is main memory
>> speed.  Disk speed only bubbles up when memory is tight.
>>     
>
> That's a nice theory, in practice that is completely wrong. You end up
> waiting on writes for LOTS of other reasons!
>
>   

Journal commits?  Can you elaborate?

In the filesystem I worked on, one would never wait on a write to disk 
unless memory was full.  Even synchronous writes were serviced 
immediately, since the system had a battery-backed replicated cache.  I 
guess the situation with Linux filesystems is different.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-19 19:30                     ` Jens Axboe
  2008-10-19 20:16                       ` Avi Kivity
@ 2008-10-20 14:14                       ` Avi Kivity
  1 sibling, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-20 14:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

Jens Axboe wrote:
> Use hdparm, it's an ATA drive even if Linux currently uses the scsi
> layer for it. Or use sysfs, there's a "cache_type" attribute in the scsi
> disk sysfs directory.
>   

With this, I was able to benchmark the write cache for 4k and 128k 
random access loads.  Numbers in iops, hope it doesn't get mangled:

                4k blocks            128k blocks
pattern   cache off   cache on  cache off   cache on
read       103         101         74         71
write       86         149         72         91
rw          87          89         63         65

Test was run on a 90G logical volume of a 250G laptop disk; using 
O_DIRECT and libaio.

Pure write workloads see a tremendous benefit, likely because the heads 
can do a linear scan of the disk.  An 8MB cache translates to 2000 
objects, likely around 1000 per pass.  Increasing the block size reduces 
the performance boost, as expected.

read/write workloads do not benefit at all (or maybe a bit); presumably 
the head movement is governed by reads alone.

Of course, this tests only the disk subsystem; in particular, if some 
workload is sensitive to write latencies, the write cache can reduce 
those in a mixed read/write load, as long as the cache is not flooded 
(so loads with a lower percentage of writes would benefit more).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
                   ` (2 preceding siblings ...)
  2008-10-10  9:16 ` Avi Kivity
@ 2008-10-10 10:03 ` Fabrice Bellard
  2008-10-13 16:11 ` Laurent Vivier
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 101+ messages in thread
From: Fabrice Bellard @ 2008-10-10 10:03 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori wrote:
> [...]
> So to summarize, I think we should enable O_DSYNC by default to ensure
> that guest data integrity is not dependent on the host OS, and that
> practically speaking, cache=off is only useful for very specialized
> circumstances.  Part of the patch I'll follow up with includes changes
> to the man page to document all of this for users.
> 
> Thoughts?

QEMU is also used for debugging and arbitrary machine simulation. In
this case, using uncached accesses is bad because you want maximum
isolation between the guest and the host. For example, if the guest is a
development OS not using a disk cache, I still want to use the host disk
cache.

So the "normal" caching scheme must be left as it is now. However, I
agree that the default behavior could be modified.

Regards,

Fabrice.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
                   ` (3 preceding siblings ...)
  2008-10-10 10:03 ` Fabrice Bellard
@ 2008-10-13 16:11 ` Laurent Vivier
  2008-10-13 16:58   ` Anthony Liguori
  2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 101+ messages in thread
From: Laurent Vivier @ 2008-10-13 16:11 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel

Le jeudi 09 octobre 2008 à 12:00 -0500, Anthony Liguori a écrit :
[...]
> So to summarize, I think we should enable O_DSYNC by default to
> ensure 
> that guest data integrity is not dependent on the host OS, and that 
> practically speaking, cache=off is only useful for very specialized 
> circumstances.  Part of the patch I'll follow up with includes
> changes 
> to the man page to document all of this for users.

perhaps I'm wrong but I think O_DSYNC (in fact O_SYNC for linux) will
impact host filesystem performance, at least with ext3, because the
synchronicity is done through the commit of the journal of the whole
filesystem:

see fs/ext3/file.c:ext3_file_write() (I've removed the comments here) :

...
        if (file->f_flags & O_SYNC) {
 
                if (!ext3_should_journal_data(inode))
                        return ret;

                goto force_commit;
        }


        if (!IS_SYNC(inode))
                return ret;

force_commit:
        err = ext3_force_commit(inode->i_sb);
        if (err)
                return err;
        return ret;
}

Moreover, the real behavior depends on the type of the journaling system
you use...

Regards,
Laurent
-- 
----------------- Laurent.Vivier@bull.net  ------------------
  "La perfection est atteinte non quand il ne reste rien à
ajouter mais quand il ne reste rien à enlever." Saint Exupéry

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13 16:11 ` Laurent Vivier
@ 2008-10-13 16:58   ` Anthony Liguori
  2008-10-13 17:36     ` Jamie Lokier
  0 siblings, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-13 16:58 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel

Laurent Vivier wrote:
> Le jeudi 09 octobre 2008 à 12:00 -0500, Anthony Liguori a écrit :
> [...]
>   
>> So to summarize, I think we should enable O_DSYNC by default to
>> ensure 
>> that guest data integrity is not dependent on the host OS, and that 
>> practically speaking, cache=off is only useful for very specialized 
>> circumstances.  Part of the patch I'll follow up with includes
>> changes 
>> to the man page to document all of this for users.
>>     
>
> perhaps I'm wrong but I think O_DSYNC (in fact O_SYNC for linux) will
> impact host filesystem performance, at least with ext3, because the
> synchronicity is done through the commit of the journal of the whole
> filesystem:
>   

Yes, but this is important because if the journal isn't committed, then 
it's possible that while the data would be on disk, the file system 
metadata is out of sync on disk which could result in the changes to the 
file being lost.

I think that you are in fact correct that the journal write is probably 
unnecessary overhead in a lot of scenarios but Ryan actually has some 
performance data that he should be posting soon that shows that in most 
circumstances, O_DSYNC does pretty well compared to O_DIRECT for write 
so I don't this is a practical concern.

Regards,

Anthony Liguori

> see fs/ext3/file.c:ext3_file_write() (I've removed the comments here) :
>
> ...
>         if (file->f_flags & O_SYNC) {
>  
>                 if (!ext3_should_journal_data(inode))
>                         return ret;
>
>                 goto force_commit;
>         }
>
>
>         if (!IS_SYNC(inode))
>                 return ret;
>
> force_commit:
>         err = ext3_force_commit(inode->i_sb);
>         if (err)
>                 return err;
>         return ret;
> }
>
> Moreover, the real behavior depends on the type of the journaling system
> you use...
>
> Regards,
> Laurent
>   

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13 16:58   ` Anthony Liguori
@ 2008-10-13 17:36     ` Jamie Lokier
  0 siblings, 0 replies; 101+ messages in thread
From: Jamie Lokier @ 2008-10-13 17:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel

Anthony Liguori wrote:
> >perhaps I'm wrong but I think O_DSYNC (in fact O_SYNC for linux) will
> >impact host filesystem performance, at least with ext3, because the
> >synchronicity is done through the commit of the journal of the whole
> >filesystem:
> >  
> 
> Yes, but this is important because if the journal isn't committed, then 
> it's possible that while the data would be on disk, the file system 
> metadata is out of sync on disk which could result in the changes to the 
> file being lost.
> 
> I think that you are in fact correct that the journal write is probably 
> unnecessary overhead in a lot of scenarios but Ryan actually has some 
> performance data that he should be posting soon that shows that in most 
> circumstances, O_DSYNC does pretty well compared to O_DIRECT for write 
> so I don't this is a practical concern.

fsync on ext3 is whacky anyway.  I haven't checked what the _real_
semantics of O_DSYNC are for ext3, but I would be surprised if it's
less whacky than fsync.

Sometimes ext3 fsync takes a very long time, because it's waiting for
lots of dirty data from other processes to be written.  (Firefox 3 was
bitten by this - it made Firefox stall repeatedly for up to half a
minute for some users.)

Sometimes ext3 fsync doesn't write all the dirty pages of a file -
there are some recent kernel patches exploring ways to fix this.

Sometimes ext3 fsync doesn't flush the disk's write cache after
writing data, despite barriers being requested, if only dirty data
blocks are written and there is no inode change.

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
                   ` (4 preceding siblings ...)
  2008-10-13 16:11 ` Laurent Vivier
@ 2008-10-13 17:06 ` Ryan Harper
  2008-10-13 18:43   ` Anthony Liguori
                     ` (2 more replies)
  2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
  2008-10-28 17:34 ` Ian Jackson
  7 siblings, 3 replies; 101+ messages in thread
From: Ryan Harper @ 2008-10-13 17:06 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel@nongnu.org, Ryan Harper

* Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]:
> Read performance should be unaffected by using O_DSYNC.  O_DIRECT will 
> significantly reduce read performance.  I think we should use O_DSYNC by 
> default and I have sent out a patch that contains that.  We will follow 
> up with benchmarks to demonstrate this.


 baremetal baseline (1g dataset):
 ---------------------------+-------+-------+--------------+------------+
 Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
 type, block size, iface    | MB/s  | usage | latency usec | latency ms |
 ---------------------------+-------+-------+--------------+------------+
 write, 16k, lvm, direct=1  | 127.7 |  12   |   11.66      |    9.48    |
 write, 64k, lvm, direct=1  | 178.4 |   5   |   13.65      |   27.15    |
 write, 1M,  lvm, direct=1  | 186.0 |   3   |  163.75      |  416.91    |
 ---------------------------+-------+-------+--------------+------------+
 read , 16k, lvm, direct=1  | 170.4 |  15   |   10.86      |    7.10    |
 read , 64k, lvm, direct=1  | 199.2 |   5   |   12.52      |   24.31    |
 read , 1M,  lvm, direct=1  | 202.0 |   3   |  133.74      |  382.67    |
 ---------------------------+-------+-------+--------------+------------+

 kvm write (1g dataset):
 ---------------------------+-------+-------+--------------+------------+
 Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
 block size,iface,cache,sync| MB/s  | usage | latency usec | latency ms |
 ---------------------------+-------+-------+--------------+------------+
 16k,virtio,off,none        | 135.0 |  94   |    9.1       |    8.71    |
 16k,virtio,on ,none        | 184.0 | 100   |   63.69      |   63.48    |
 16k,virtio,on ,O_DSYNC     | 150.0 |  35   |    6.63      |    8.31    |
 ---------------------------+-------+-------+--------------+------------+
 64k,virtio,off,none        | 169.0 |  51   |   17.10      |   28.00    |
 64k,virtio,on ,none        | 189.0 |  60   |   69.42      |   24.92    |
 64k,virtio,on ,O_DSYNC     | 171.0 |  48   |   18.83      |   27.72    |
 ---------------------------+-------+-------+--------------+------------+
 1M ,virtio,off,none        | 142.0 |  30   |  7176.00     |  523.00    |
 1M ,virtio,on ,none        | 190.0 |  45   |  5332.63     |  392.35    |
 1M ,virtio,on ,O_DSYNC     | 164.0 |  39   |  6444.48     |  471.20    |
 ---------------------------+-------+-------+--------------+------------+

 kvm read (1g dataset):
 ---------------------------+-------+-------+--------------+------------+
 Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
 block size,iface,cache,sync| MB/s  | usage | latency usec | latency ms |
 ---------------------------+-------+-------+--------------+------------+
 16k,virtio,off,none        | 175.0 |  40   |   22.42      |    6.71    |
 16k,virtio,on ,none        | 211.0 | 147   |   59.49      |    5.54    |
 16k,virtio,on ,O_DSYNC     | 212.0 | 145   |   60.45      |    5.47    |
 ---------------------------+-------+-------+--------------+------------+
 64k,virtio,off,none        | 190.0 |  64   |   16.31      |   24.92    |
 64k,virtio,on ,none        | 546.0 | 161   |  111.06      |    8.54    |
 64k,virtio,on ,O_DSYNC     | 520.0 | 151   |  116.66      |    8.97    |
 ---------------------------+-------+-------+--------------+------------+
 1M ,virtio,off,none        | 182.0 |  32   | 5573.44      |  407.21    |
 1M ,virtio,on ,none        | 750.0 | 127   | 1344.65      |   96.42    |
 1M ,virtio,on ,O_DSYNC     | 768.0 | 123   | 1289.05      |   94.25    |
 ---------------------------+-------+-------+--------------+------------+

 --------------------------------------------------------------------------
 exporting file in ext3 filesystem as block device (1g)
 --------------------------------------------------------------------------

 kvm write (1g dataset):
 ---------------------------+-------+-------+--------------+------------+
 Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
 block size,iface,cache,sync| MB/s  | usage | latency usec | latency ms |
 ---------------------------+-------+-------+--------------+------------+
 16k,virtio,off,none        |  12.1 |  15   |    9.1       |    8.71    |
 16k,virtio,on ,none        | 192.0 |  52   |   62.52      |    6.17    |
 16k,virtio,on ,O_DSYNC     | 142.0 |  59   |   18.81      |    8.29    |
 ---------------------------+-------+-------+--------------+------------+
 64k,virtio,off,none        |  15.5 |   8   |   21.10      |  311.00    |
 64k,virtio,on ,none        | 454.0 | 130   |  113.25      |   10.65    |
 64k,virtio,on ,O_DSYNC     | 154.0 |  48   |   20.25      |   30.75    |
 ---------------------------+-------+-------+--------------+------------+
 1M ,virtio,off,none        |  24.7 |   5   | 41736.22     | 3020.08    |
 1M ,virtio,on ,none        | 485.0 | 100   |  2052.09     |  149.81    |
 1M ,virtio,on ,O_DSYNC     | 161.0 |  42   |  6268.84     |  453.84    |
 ---------------------------+-------+-------+--------------+------------+


--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
@ 2008-10-13 18:43   ` Anthony Liguori
  2008-10-14 16:42     ` Avi Kivity
  2008-10-13 18:51   ` Laurent Vivier
  2008-10-13 19:00   ` Mark Wagner
  2 siblings, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-13 18:43 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel@nongnu.org

Ryan Harper wrote:
> * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]:
>   
>> Read performance should be unaffected by using O_DSYNC.  O_DIRECT will 
>> significantly reduce read performance.  I think we should use O_DSYNC by 
>> default and I have sent out a patch that contains that.  We will follow 
>> up with benchmarks to demonstrate this.
>>
>>     

With 16k writes I think we hit a pathological case with the particular 
storage backend we're using since it has many disks and the volume is 
striped.  Also the results a bit different when going through a file 
system verses a LVM partition (the later being the first data set).  
Presumably, this is because even with no flags, writes happen 
synchronously to a LVM partition.

Also, cache=off seems to do pretty terribly when operating on an ext3 
file.  I suspect this has to do with how ext3 implements O_DIRECT.

However, the data demonstrates pretty nicely that O_DSYNC gives you 
native write speed, but accelerated read speed which I think we agree is 
the desirable behavior.  cache=off never seems to outperform cache=wt 
which is another good argument for it being the default over cache=off.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 18:43   ` Anthony Liguori
@ 2008-10-14 16:42     ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 16:42 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori wrote:
>
> With 16k writes I think we hit a pathological case with the particular
> storage backend we're using since it has many disks and the volume is
> striped.  Also the results a bit different when going through a file
> system verses a LVM partition (the later being the first data set). 
> Presumably, this is because even with no flags, writes happen
> synchronously to a LVM partition.
>

With no flags, writes should hit the buffer cache (which is the page
cache's name when used to cache block devices).

> Also, cache=off seems to do pretty terribly when operating on an ext3
> file.  I suspect this has to do with how ext3 implements O_DIRECT.

Is the file horribly fragmented?  Otherwise ext3 O_DIRECT should be
quite good.

Maybe the mapping is not in the host cache and has to be brought in.

>
> However, the data demonstrates pretty nicely that O_DSYNC gives you
> native write speed, but accelerated read speed which I think we agree
> is the desirable behavior.  cache=off never seems to outperform
> cache=wt which is another good argument for it being the default over
> cache=off.

Without copyless block I/O, there's no reason to expect cache=none to
outperform cache=writethrough.  I expect the read performance to
evaporate with a random access pattern over a large disk (or even
sequential access, given enough running time).

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
  2008-10-13 18:43   ` Anthony Liguori
@ 2008-10-13 18:51   ` Laurent Vivier
  2008-10-13 19:43     ` Ryan Harper
  2008-10-13 19:00   ` Mark Wagner
  2 siblings, 1 reply; 101+ messages in thread
From: Laurent Vivier @ 2008-10-13 18:51 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

[-- Attachment #1: Type: text/plain, Size: 6855 bytes --]


Le 13 oct. 08 à 19:06, Ryan Harper a écrit :

> * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]:
>> Read performance should be unaffected by using O_DSYNC.  O_DIRECT  
>> will
>> significantly reduce read performance.  I think we should use  
>> O_DSYNC by
>> default and I have sent out a patch that contains that.  We will  
>> follow
>> up with benchmarks to demonstrate this.
>

Hi Ryan,

as "cache=on" implies a factor (memory) shared by the whole system,  
you must take into account the size of the host memory and run some  
applications (several guests ?) to pollute the host cache, for  
instance you can run 4 guest and run bench in each of them  
concurrently, and you could reasonably limits the size of the host  
memory to 5 x the size of the guest memory.
(for instance 4 guests with 128 MB on a host with 768 MB).

as O_DSYNC implies journal commit, you should run a bench on the ext3  
host file system concurrently to the bench on a guest to see the  
impact of the commit on each bench.

>
> baremetal baseline (1g dataset):
> ---------------------------+-------+-------+-------------- 
> +------------+
> Test scenarios             | bandw | % CPU | ave submit   | ave  
> compl  |
> type, block size, iface    | MB/s  | usage | latency usec | latency  
> ms |
> ---------------------------+-------+-------+-------------- 
> +------------+
> write, 16k, lvm, direct=1  | 127.7 |  12   |   11.66      |     
> 9.48    |
> write, 64k, lvm, direct=1  | 178.4 |   5   |   13.65      |    
> 27.15    |
> write, 1M,  lvm, direct=1  | 186.0 |   3   |  163.75      |   
> 416.91    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> read , 16k, lvm, direct=1  | 170.4 |  15   |   10.86      |     
> 7.10    |
> read , 64k, lvm, direct=1  | 199.2 |   5   |   12.52      |    
> 24.31    |
> read , 1M,  lvm, direct=1  | 202.0 |   3   |  133.74      |   
> 382.67    |
> ---------------------------+-------+-------+-------------- 
> +------------+
>

Could you recall which benchmark you use ?

> kvm write (1g dataset):
> ---------------------------+-------+-------+-------------- 
> +------------+
> Test scenarios             | bandw | % CPU | ave submit   | ave  
> compl  |
> block size,iface,cache,sync| MB/s  | usage | latency usec | latency  
> ms |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 16k,virtio,off,none        | 135.0 |  94   |    9.1       |     
> 8.71    |
> 16k,virtio,on ,none        | 184.0 | 100   |   63.69      |    
> 63.48    |
> 16k,virtio,on ,O_DSYNC     | 150.0 |  35   |    6.63      |     
> 8.31    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 64k,virtio,off,none        | 169.0 |  51   |   17.10      |    
> 28.00    |
> 64k,virtio,on ,none        | 189.0 |  60   |   69.42      |    
> 24.92    |
> 64k,virtio,on ,O_DSYNC     | 171.0 |  48   |   18.83      |    
> 27.72    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 1M ,virtio,off,none        | 142.0 |  30   |  7176.00     |   
> 523.00    |
> 1M ,virtio,on ,none        | 190.0 |  45   |  5332.63     |   
> 392.35    |
> 1M ,virtio,on ,O_DSYNC     | 164.0 |  39   |  6444.48     |   
> 471.20    |
> ---------------------------+-------+-------+-------------- 
> +------------+

According to the semantic, I don't understand how O_DSYNC can be  
better than cache=off in this case...

>
> kvm read (1g dataset):
> ---------------------------+-------+-------+-------------- 
> +------------+
> Test scenarios             | bandw | % CPU | ave submit   | ave  
> compl  |
> block size,iface,cache,sync| MB/s  | usage | latency usec | latency  
> ms |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 16k,virtio,off,none        | 175.0 |  40   |   22.42      |     
> 6.71    |
> 16k,virtio,on ,none        | 211.0 | 147   |   59.49      |     
> 5.54    |
> 16k,virtio,on ,O_DSYNC     | 212.0 | 145   |   60.45      |     
> 5.47    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 64k,virtio,off,none        | 190.0 |  64   |   16.31      |    
> 24.92    |
> 64k,virtio,on ,none        | 546.0 | 161   |  111.06      |     
> 8.54    |
> 64k,virtio,on ,O_DSYNC     | 520.0 | 151   |  116.66      |     
> 8.97    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 1M ,virtio,off,none        | 182.0 |  32   | 5573.44      |   
> 407.21    |
> 1M ,virtio,on ,none        | 750.0 | 127   | 1344.65      |    
> 96.42    |
> 1M ,virtio,on ,O_DSYNC     | 768.0 | 123   | 1289.05      |    
> 94.25    |
> ---------------------------+-------+-------+-------------- 
> +------------+

OK, but in this case the size of the cache for "cache=off" is the size  
of the guest cache whereas in the other cases the size of the cache is  
the size of the guest cache + the size of the host cache, this is not  
fair...

>
> --------------------------------------------------------------------------
> exporting file in ext3 filesystem as block device (1g)
> --------------------------------------------------------------------------
>
> kvm write (1g dataset):
> ---------------------------+-------+-------+-------------- 
> +------------+
> Test scenarios             | bandw | % CPU | ave submit   | ave  
> compl  |
> block size,iface,cache,sync| MB/s  | usage | latency usec | latency  
> ms |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 16k,virtio,off,none        |  12.1 |  15   |    9.1       |     
> 8.71    |
> 16k,virtio,on ,none        | 192.0 |  52   |   62.52      |     
> 6.17    |
> 16k,virtio,on ,O_DSYNC     | 142.0 |  59   |   18.81      |     
> 8.29    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 64k,virtio,off,none        |  15.5 |   8   |   21.10      |   
> 311.00    |
> 64k,virtio,on ,none        | 454.0 | 130   |  113.25      |    
> 10.65    |
> 64k,virtio,on ,O_DSYNC     | 154.0 |  48   |   20.25      |    
> 30.75    |
> ---------------------------+-------+-------+-------------- 
> +------------+
> 1M ,virtio,off,none        |  24.7 |   5   | 41736.22     |  
> 3020.08    |
> 1M ,virtio,on ,none        | 485.0 | 100   |  2052.09     |   
> 149.81    |
> 1M ,virtio,on ,O_DSYNC     | 161.0 |  42   |  6268.84     |   
> 453.84    |
> ---------------------------+-------+-------+-------------- 
> +------------+

What file type do you use (qcow2, raw ?).

Regards,
Laurent

----------------------- Laurent Vivier ----------------------
"The best way to predict the future is to invent it."
- Alan Kay






[-- Attachment #2: Type: text/html, Size: 13285 bytes --]

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 18:51   ` Laurent Vivier
@ 2008-10-13 19:43     ` Ryan Harper
  2008-10-13 20:21       ` Laurent Vivier
                         ` (2 more replies)
  0 siblings, 3 replies; 101+ messages in thread
From: Ryan Harper @ 2008-10-13 19:43 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

* Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]:
> 
> Le 13 oct. 08 à 19:06, Ryan Harper a écrit :
> 
> >* Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]:
> >>Read performance should be unaffected by using O_DSYNC.  O_DIRECT  
> >>will
> >>significantly reduce read performance.  I think we should use  
> >>O_DSYNC by
> >>default and I have sent out a patch that contains that.  We will  
> >>follow
> >>up with benchmarks to demonstrate this.
> >
> 
> Hi Ryan,
> 
> as "cache=on" implies a factor (memory) shared by the whole system,  
> you must take into account the size of the host memory and run some  
> applications (several guests ?) to pollute the host cache, for  
> instance you can run 4 guest and run bench in each of them  
> concurrently, and you could reasonably limits the size of the host  
> memory to 5 x the size of the guest memory.
> (for instance 4 guests with 128 MB on a host with 768 MB).

I'm not following you here, the only assumption I see is that we have 1g
of host mem free for caching the write.


> 
> as O_DSYNC implies journal commit, you should run a bench on the ext3  
> host file system concurrently to the bench on a guest to see the  
> impact of the commit on each bench.

I understand the goal here, but what sort of host ext3 journaling load
is appropriate.  Additionally, when we're exporting block devices, I
don't believe the ext3 journal is an issue.

> 
> >
> >baremetal baseline (1g dataset):
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >Test scenarios             | bandw | % CPU | ave submit   | ave  
> >compl  |
> >type, block size, iface    | MB/s  | usage | latency usec | latency  
> >ms |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >write, 16k, lvm, direct=1  | 127.7 |  12   |   11.66      |     
> >9.48    |
> >write, 64k, lvm, direct=1  | 178.4 |   5   |   13.65      |    
> >27.15    |
> >write, 1M,  lvm, direct=1  | 186.0 |   3   |  163.75      |   
> >416.91    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >read , 16k, lvm, direct=1  | 170.4 |  15   |   10.86      |     
> >7.10    |
> >read , 64k, lvm, direct=1  | 199.2 |   5   |   12.52      |    
> >24.31    |
> >read , 1M,  lvm, direct=1  | 202.0 |   3   |  133.74      |   
> >382.67    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >
> 
> Could you recall which benchmark you use ?

yeah:

fio --name=guestrun --filename=/dev/vda --rw=write --bs=${SIZE}
--ioengine=libaio --direct=1 --norandommap --numjobs=1 --group_reporting
--thread --size=1g --write_lat_log --write_bw_log --iodepth=74

> 
> >kvm write (1g dataset):
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >Test scenarios             | bandw | % CPU | ave submit   | ave  
> >compl  |
> >block size,iface,cache,sync| MB/s  | usage | latency usec | latency  
> >ms |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >16k,virtio,off,none        | 135.0 |  94   |    9.1       |     
> >8.71    |
> >16k,virtio,on ,none        | 184.0 | 100   |   63.69      |    
> >63.48    |
> >16k,virtio,on ,O_DSYNC     | 150.0 |  35   |    6.63      |     
> >8.31    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >64k,virtio,off,none        | 169.0 |  51   |   17.10      |    
> >28.00    |
> >64k,virtio,on ,none        | 189.0 |  60   |   69.42      |    
> >24.92    |
> >64k,virtio,on ,O_DSYNC     | 171.0 |  48   |   18.83      |    
> >27.72    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >1M ,virtio,off,none        | 142.0 |  30   |  7176.00     |   
> >523.00    |
> >1M ,virtio,on ,none        | 190.0 |  45   |  5332.63     |   
> >392.35    |
> >1M ,virtio,on ,O_DSYNC     | 164.0 |  39   |  6444.48     |   
> >471.20    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> 
> According to the semantic, I don't understand how O_DSYNC can be  
> better than cache=off in this case...

I don't have a good answer either, but O_DIRECT and O_DSYNC are
different paths through the kernel.  This deserves a better reply, but
I don't have one off the top of my head.

> 
> >
> >kvm read (1g dataset):
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >Test scenarios             | bandw | % CPU | ave submit   | ave  
> >compl  |
> >block size,iface,cache,sync| MB/s  | usage | latency usec | latency  
> >ms |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >16k,virtio,off,none        | 175.0 |  40   |   22.42      |     
> >6.71    |
> >16k,virtio,on ,none        | 211.0 | 147   |   59.49      |     
> >5.54    |
> >16k,virtio,on ,O_DSYNC     | 212.0 | 145   |   60.45      |     
> >5.47    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >64k,virtio,off,none        | 190.0 |  64   |   16.31      |    
> >24.92    |
> >64k,virtio,on ,none        | 546.0 | 161   |  111.06      |     
> >8.54    |
> >64k,virtio,on ,O_DSYNC     | 520.0 | 151   |  116.66      |     
> >8.97    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >1M ,virtio,off,none        | 182.0 |  32   | 5573.44      |   
> >407.21    |
> >1M ,virtio,on ,none        | 750.0 | 127   | 1344.65      |    
> >96.42    |
> >1M ,virtio,on ,O_DSYNC     | 768.0 | 123   | 1289.05      |    
> >94.25    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> 
> OK, but in this case the size of the cache for "cache=off" is the size  
> of the guest cache whereas in the other cases the size of the cache is  
> the size of the guest cache + the size of the host cache, this is not  
> fair...

it isn't supposed to be fair, cache=off is O_DIRECT, we're reading from
the device, we *want* to be able to lean on the host cache to read the
data, pay once and benefit in other guests if possible.

> 
> >
> >--------------------------------------------------------------------------
> >exporting file in ext3 filesystem as block device (1g)
> >--------------------------------------------------------------------------
> >
> >kvm write (1g dataset):
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >Test scenarios             | bandw | % CPU | ave submit   | ave  
> >compl  |
> >block size,iface,cache,sync| MB/s  | usage | latency usec | latency  
> >ms |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >16k,virtio,off,none        |  12.1 |  15   |    9.1       |     
> >8.71    |
> >16k,virtio,on ,none        | 192.0 |  52   |   62.52      |     
> >6.17    |
> >16k,virtio,on ,O_DSYNC     | 142.0 |  59   |   18.81      |     
> >8.29    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >64k,virtio,off,none        |  15.5 |   8   |   21.10      |   
> >311.00    |
> >64k,virtio,on ,none        | 454.0 | 130   |  113.25      |    
> >10.65    |
> >64k,virtio,on ,O_DSYNC     | 154.0 |  48   |   20.25      |    
> >30.75    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> >1M ,virtio,off,none        |  24.7 |   5   | 41736.22     |  
> >3020.08    |
> >1M ,virtio,on ,none        | 485.0 | 100   |  2052.09     |   
> >149.81    |
> >1M ,virtio,on ,O_DSYNC     | 161.0 |  42   |  6268.84     |   
> >453.84    |
> >---------------------------+-------+-------+-------------- 
> >+------------+
> 
> What file type do you use (qcow2, raw ?).

Raw.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 19:43     ` Ryan Harper
@ 2008-10-13 20:21       ` Laurent Vivier
  2008-10-13 21:05         ` Ryan Harper
  2008-10-14 10:05       ` Kevin Wolf
  2008-10-14 16:37       ` Avi Kivity
  2 siblings, 1 reply; 101+ messages in thread
From: Laurent Vivier @ 2008-10-13 20:21 UTC (permalink / raw)
  To: Ryan Harper; +Cc: Chris Wright, Mark McLoughlin, qemu-devel, Laurent Vivier


Le 13 oct. 08 à 21:43, Ryan Harper a écrit :

> * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]:
>>
>> Le 13 oct. 08 à 19:06, Ryan Harper a écrit :
>>
>>> * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]:
>>>> Read performance should be unaffected by using O_DSYNC.  O_DIRECT
>>>> will
>>>> significantly reduce read performance.  I think we should use
>>>> O_DSYNC by
>>>> default and I have sent out a patch that contains that.  We will
>>>> follow
>>>> up with benchmarks to demonstrate this.
>>>
>>
>> Hi Ryan,
>>
>> as "cache=on" implies a factor (memory) shared by the whole system,
>> you must take into account the size of the host memory and run some
>> applications (several guests ?) to pollute the host cache, for
>> instance you can run 4 guest and run bench in each of them
>> concurrently, and you could reasonably limits the size of the host
>> memory to 5 x the size of the guest memory.
>> (for instance 4 guests with 128 MB on a host with 768 MB).
>
> I'm not following you here, the only assumption I see is that we  
> have 1g
> of host mem free for caching the write.

Is this a realistic use case ?

>
>>
>> as O_DSYNC implies journal commit, you should run a bench on the ext3
>> host file system concurrently to the bench on a guest to see the
>> impact of the commit on each bench.
>
> I understand the goal here, but what sort of host ext3 journaling load
> is appropriate.  Additionally, when we're exporting block devices, I
> don't believe the ext3 journal is an issue.

Yes, it's a comment for the last test case.
I think you can run the same benchmark as you do in the guest.

>
>
>>
>>>
>>> baremetal baseline (1g dataset):
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> Test scenarios             | bandw | % CPU | ave submit   | ave
>>> compl  |
>>> type, block size, iface    | MB/s  | usage | latency usec | latency
>>> ms |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> write, 16k, lvm, direct=1  | 127.7 |  12   |   11.66      |
>>> 9.48    |
>>> write, 64k, lvm, direct=1  | 178.4 |   5   |   13.65      |
>>> 27.15    |
>>> write, 1M,  lvm, direct=1  | 186.0 |   3   |  163.75      |
>>> 416.91    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> read , 16k, lvm, direct=1  | 170.4 |  15   |   10.86      |
>>> 7.10    |
>>> read , 64k, lvm, direct=1  | 199.2 |   5   |   12.52      |
>>> 24.31    |
>>> read , 1M,  lvm, direct=1  | 202.0 |   3   |  133.74      |
>>> 382.67    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>>
>>
>> Could you recall which benchmark you use ?
>
> yeah:
>
> fio --name=guestrun --filename=/dev/vda --rw=write --bs=${SIZE}
> --ioengine=libaio --direct=1 --norandommap --numjobs=1 -- 
> group_reporting
> --thread --size=1g --write_lat_log --write_bw_log --iodepth=74
>

Thank you...

>>
>>> kvm write (1g dataset):
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> Test scenarios             | bandw | % CPU | ave submit   | ave
>>> compl  |
>>> block size,iface,cache,sync| MB/s  | usage | latency usec | latency
>>> ms |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 16k,virtio,off,none        | 135.0 |  94   |    9.1       |
>>> 8.71    |
>>> 16k,virtio,on ,none        | 184.0 | 100   |   63.69      |
>>> 63.48    |
>>> 16k,virtio,on ,O_DSYNC     | 150.0 |  35   |    6.63      |
>>> 8.31    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 64k,virtio,off,none        | 169.0 |  51   |   17.10      |
>>> 28.00    |
>>> 64k,virtio,on ,none        | 189.0 |  60   |   69.42      |
>>> 24.92    |
>>> 64k,virtio,on ,O_DSYNC     | 171.0 |  48   |   18.83      |
>>> 27.72    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 1M ,virtio,off,none        | 142.0 |  30   |  7176.00     |
>>> 523.00    |
>>> 1M ,virtio,on ,none        | 190.0 |  45   |  5332.63     |
>>> 392.35    |
>>> 1M ,virtio,on ,O_DSYNC     | 164.0 |  39   |  6444.48     |
>>> 471.20    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>
>> According to the semantic, I don't understand how O_DSYNC can be
>> better than cache=off in this case...
>
> I don't have a good answer either, but O_DIRECT and O_DSYNC are
> different paths through the kernel.  This deserves a better reply, but
> I don't have one off the top of my head.

The O_DIRECT kernel path should be more "direct" than the O_DSYNC one.  
Perhaps a oprofile could help to understand ?
What it is strange also is the CPU usage with cache=off. It should be  
lower than others, perhaps an alignment issue ?
  due to the LVM ?

>
>
>>
>>>
>>> kvm read (1g dataset):
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> Test scenarios             | bandw | % CPU | ave submit   | ave
>>> compl  |
>>> block size,iface,cache,sync| MB/s  | usage | latency usec | latency
>>> ms |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 16k,virtio,off,none        | 175.0 |  40   |   22.42      |
>>> 6.71    |
>>> 16k,virtio,on ,none        | 211.0 | 147   |   59.49      |
>>> 5.54    |
>>> 16k,virtio,on ,O_DSYNC     | 212.0 | 145   |   60.45      |
>>> 5.47    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 64k,virtio,off,none        | 190.0 |  64   |   16.31      |
>>> 24.92    |
>>> 64k,virtio,on ,none        | 546.0 | 161   |  111.06      |
>>> 8.54    |
>>> 64k,virtio,on ,O_DSYNC     | 520.0 | 151   |  116.66      |
>>> 8.97    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 1M ,virtio,off,none        | 182.0 |  32   | 5573.44      |
>>> 407.21    |
>>> 1M ,virtio,on ,none        | 750.0 | 127   | 1344.65      |
>>> 96.42    |
>>> 1M ,virtio,on ,O_DSYNC     | 768.0 | 123   | 1289.05      |
>>> 94.25    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>
>> OK, but in this case the size of the cache for "cache=off" is the  
>> size
>> of the guest cache whereas in the other cases the size of the cache  
>> is
>> the size of the guest cache + the size of the host cache, this is not
>> fair...
>
> it isn't supposed to be fair, cache=off is O_DIRECT, we're reading  
> from
> the device, we *want* to be able to lean on the host cache to read the
> data, pay once and benefit in other guests if possible.

OK, but if you want to follow this way I think you must run several  
guests concurrently to see how the host cache help each of them.
If you want I can try this tomorrow ? The O_DSYNC patch is the one  
posted to the mailing-list ?

And moreover, you should run an endurance test to see how the cache  
evolves.

>
>>
>>>
>>> --------------------------------------------------------------------------
>>> exporting file in ext3 filesystem as block device (1g)
>>> --------------------------------------------------------------------------
>>>
>>> kvm write (1g dataset):
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> Test scenarios             | bandw | % CPU | ave submit   | ave
>>> compl  |
>>> block size,iface,cache,sync| MB/s  | usage | latency usec | latency
>>> ms |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 16k,virtio,off,none        |  12.1 |  15   |    9.1       |
>>> 8.71    |
>>> 16k,virtio,on ,none        | 192.0 |  52   |   62.52      |
>>> 6.17    |
>>> 16k,virtio,on ,O_DSYNC     | 142.0 |  59   |   18.81      |
>>> 8.29    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 64k,virtio,off,none        |  15.5 |   8   |   21.10      |
>>> 311.00    |
>>> 64k,virtio,on ,none        | 454.0 | 130   |  113.25      |
>>> 10.65    |
>>> 64k,virtio,on ,O_DSYNC     | 154.0 |  48   |   20.25      |
>>> 30.75    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>> 1M ,virtio,off,none        |  24.7 |   5   | 41736.22     |
>>> 3020.08    |
>>> 1M ,virtio,on ,none        | 485.0 | 100   |  2052.09     |
>>> 149.81    |
>>> 1M ,virtio,on ,O_DSYNC     | 161.0 |  42   |  6268.84     |
>>> 453.84    |
>>> ---------------------------+-------+-------+--------------
>>> +------------+
>>
>> What file type do you use (qcow2, raw ?).
>
> Raw.

No comment

Laurent
----------------------- Laurent Vivier ----------------------
"The best way to predict the future is to invent it."
- Alan Kay

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 20:21       ` Laurent Vivier
@ 2008-10-13 21:05         ` Ryan Harper
  2008-10-15 13:10           ` Laurent Vivier
  0 siblings, 1 reply; 101+ messages in thread
From: Ryan Harper @ 2008-10-13 21:05 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: Chris Wright, Mark McLoughlin, Laurent Vivier, qemu-devel,
	Ryan Harper

* Laurent Vivier <laurent@lvivier.info> [2008-10-13 15:39]:
> >>
> >>as "cache=on" implies a factor (memory) shared by the whole system,
> >>you must take into account the size of the host memory and run some
> >>applications (several guests ?) to pollute the host cache, for
> >>instance you can run 4 guest and run bench in each of them
> >>concurrently, and you could reasonably limits the size of the host
> >>memory to 5 x the size of the guest memory.
> >>(for instance 4 guests with 128 MB on a host with 768 MB).
> >
> >I'm not following you here, the only assumption I see is that we  
> >have 1g
> >of host mem free for caching the write.
> 
> Is this a realistic use case ?

Optimistic? I don't think it is unrealistic.  It is hard to know what
hardware and use-case any end user may have at their disposal.

> >>
> >>as O_DSYNC implies journal commit, you should run a bench on the ext3
> >>host file system concurrently to the bench on a guest to see the
> >>impact of the commit on each bench.
> >
> >I understand the goal here, but what sort of host ext3 journaling load
> >is appropriate.  Additionally, when we're exporting block devices, I
> >don't believe the ext3 journal is an issue.
> 
> Yes, it's a comment for the last test case.
> I think you can run the same benchmark as you do in the guest.

I'm not sure where to go with this.  If it turns out that scaling out on
to of ext3 stinks, then the deployment needs to change to deal with that
limitation in ext3.  Use a proper block device, something like lvm.

> >>According to the semantic, I don't understand how O_DSYNC can be
> >>better than cache=off in this case...
> >
> >I don't have a good answer either, but O_DIRECT and O_DSYNC are
> >different paths through the kernel.  This deserves a better reply, but
> >I don't have one off the top of my head.
> 
> The O_DIRECT kernel path should be more "direct" than the O_DSYNC one.  
> Perhaps a oprofile could help to understand ?
> What it is strange also is the CPU usage with cache=off. It should be  
> lower than others, perhaps an alignment issue ?
>  due to the LVM ?

All possible, I don't have an oprofile of it.

> >>
> >>OK, but in this case the size of the cache for "cache=off" is the  
> >>size
> >>of the guest cache whereas in the other cases the size of the cache  
> >>is
> >>the size of the guest cache + the size of the host cache, this is not
> >>fair...
> >
> >it isn't supposed to be fair, cache=off is O_DIRECT, we're reading  
> >from
> >the device, we *want* to be able to lean on the host cache to read the
> >data, pay once and benefit in other guests if possible.
> 
> OK, but if you want to follow this way I think you must run several  
> guests concurrently to see how the host cache help each of them.
> If you want I can try this tomorrow ? The O_DSYNC patch is the one  
> posted to the mailing-list ?

The patch used is the same as what is on the list, feel free to try.

> 
> And moreover, you should run an endurance test to see how the cache  
> evolves.

I'm not sure how interesting this is, either it was in the cache or not,
depending on what work you do you can either devolve to a case where
nothing is in cache or where everything is in cache.  The point being
that by using cache where we can we get the benefit.  If you use
cache=off you'll never be able to get that boost when it would other wise
been available.


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 21:05         ` Ryan Harper
@ 2008-10-15 13:10           ` Laurent Vivier
  2008-10-16 10:24             ` Laurent Vivier
  0 siblings, 1 reply; 101+ messages in thread
From: Laurent Vivier @ 2008-10-15 13:10 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper

Hi,

I made some tests on my system.

Intel Xeon, 2 GB RAM
Disk SATA 80 GB, using 4 GB Partitions

my guests are:

qemu/x86_64-softmmu/qemu-system-x86_64 -hda ../victory.qcow2 -drive file=/dev/sdc1,if=virtio,cache=on -net nic,model=virtio,macaddress=52:54:00:12:34:71 -net tap -serial stdio -m 512 -nographic

qemu/x86_64-softmmu/qemu-system-x86_64 -hda ../valkyrie.qcow2 -drive file=/dev/sdc2,if=virtio,cache=on -net nic,model=virtio,macaddress=52:54:00:12:34:72 -net tap -serial stdio -m 512 -nographic

I use the fio command given by Ryan with a 5 GB dataset (bigger than
host RAM).

Results follow.


baremetal


               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 59.86 |     4.29 |    20.25 |
write,64k      | 59.87 |     7.65 |    80.99 |
write,1M       | 59.87 | 14935.89 |  1280.71 |
---------------+-------+----------+----------+
read,16k       | 59.87 |     3.98 |    20,24 |
read,64k       | 59.88 |     8.19 |    80.98 |
read,1M        | 59.85 | 14959.63 |  1280.55 |
---------------+-------+----------+----------+


one guest, cache=on

               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 59.35 |    44.64 |    20.38 |
write,64k      | 53.40 |    70.87 |    90.72 |
write,1M       | 54.81 | 18963.69 |  1395.37 |
---------------+-------+----------+----------+
read,16k       | 35.62 |     7.84 |    34.02 |
read,64k       | 34.27 |    11.86 |   141.48 |
read,1M        | 17.50 | 59689.95 |  4344.10 |
---------------+-------+----------+----------+

one guest, cache=off

               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 59.31 |     4.44 |    20.43 |
write,64k      | 14.90 |    11.54 |   325.49 |
write,1M       | 23.37 | 44683.35 |  3255.03 |
---------------+-------+----------+----------+
read,16k       | 59.00 |     4.41 |    20.54 |
read,64k       | 13.04 |    11.84 |   371.80 |
read,1M        | 17.79 | 58712.11 |  4277.20 |
---------------+-------+----------+----------+

one guest, cache=on, O_DSYNC

               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 54.44 |    13.07 |    22.25 |
write,64k      | 54.19 |    13.10 |    89.48 |
write,1M       | 58.77 | 17763.85 |  1295.22 |
---------------+-------+----------+----------+
read,16k       | 35.27 |     7.83 |    34.36 |
read,64k       | 33.59 |    11.74 |   144.36 |
read,1M        | 17.44 | 59856.18 |  4357.69 |
---------------+-------+----------+----------+

two guests, cache=on

               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 19.20 |    36.83 |    63.11 |
               | 18.90 |    35.06 |    64.10 |
write,64k      | 18.22 |    62.46 |   266.09 |
               | 17.68 |    61.64 |   274.89 |
write,1M       | 17.18 | 60442.52 |  4454.48 |
               | 17.11 | 61137.82 |  4424.15 |
---------------+-------+----------+----------+
read,16k       | 16.32 |     8.19 |    74.25 |
               | 20.62 |     7.17 |    58.77 |
read,64k       | 13.02 |    14.05 |   372.35 |
               | 13.47 |    14.60 |   359.95 |
read,1M        |  7.68 |135632.60 |  9909.40 |
               |  7.62 |137367.63 |  9985.99 |
---------------+-------+----------+----------+

two guests, cache=off

               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 26.39 |     7.08 |    45.58 |
               | 26.40 |     8.33 |    45.90 |
write,64k      |  8.08 |    12.77 |   599.79 |
               |  8.09 |    12.87 |   599.59 |
write,1M       | 10.27 |101694.60 |  7410.92 |
               | 10.28 |101513.20 |  7405.89 |
---------------+-------+----------+----------+
read,16k       | 42.36 |     4.60 |    28.60 |
               | 27.96 |    14.56 |    43.31 |
read,64k       |  5.84 |    13.31 |   830.94 |
               |  5.83 |    22.27 |   830.62 |
read,1M        |  7.82 |133631.63 |  9730.10 |
               |  7.82 |133351.59 |  9725.79 |
---------------+-------+----------+----------+

two guests, cache=on, O_DSYNC

               |  MB/s | avg sub  | avg comp |
               |       | lat (us) | lat (ms) |
---------------+-------+----------+----------+
write,16k      | 19.77 |    17.36 |    61.29 |
               | 19.73 |     6.36 |    61.43 |
write,64k      | 23.10 |    14.00 |   209.94 |
               | 36.25 |    14.51 |    25.22 |
write,1M       | 23.94 | 43704.88 |  3146.77 |
               | 36.68 | 28456.63 |  2073.53 |
---------------+-------+----------+----------+
read,16k       | 16.38 |     8.04 |    73.99 |
               | 20.08 |     6.88 |    60.38 |
read,64k       | 11.39 |    15.22 |   425.61 |
               | 11.50 |    14.97 |   421.55 |
read,1M        |  7.68 |135693.24 |  9914.71 |
               |  7.61 |137409.27 |  9984.48 |
---------------+-------+----------+----------+


-- 
------------------ Laurent.Vivier@bull.net  ------------------
"Tout ce qui est impossible reste à accomplir"    Jules Verne
"Things are only impossible until they're not" Jean-Luc Picard

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-15 13:10           ` Laurent Vivier
@ 2008-10-16 10:24             ` Laurent Vivier
  2008-10-16 13:43               ` Anthony Liguori
  0 siblings, 1 reply; 101+ messages in thread
From: Laurent Vivier @ 2008-10-16 10:24 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper

Hi,

I've made a benchmark using a database:
mysql and sysbench in OLTP mode.

cache=off seems to be the best choice in this case...

mysql database
http://sysbench.sourceforge.net

sysbench --test=oltp

200,000 requests on 2,000,000 rows table.

                 | total time | per-request stat (ms) |
                 |  (seconds) |  min  |  avg  |  max  |
-----------------+------------+-------+-------+-------+
baremetal        |   208.6237 |   2.5 |  16.7 | 942.6 |
-----------------+------------+-------+-------+-------+
cache=on         |   642.2962 |   2.5 |  51.4 | 326.9 |
-----------------+------------+-------+-------+-------+
cache=on,O_DSYNC |   646.6570 |   2.7 |  51.7 | 347.0 |
-----------------+------------+-------+-------+-------+
cache=off        |   635.4424 |   2.9 |  50.8 | 399.5 |
-----------------+------------+-------+-------+-------+

Laurent
-- 
------------------ Laurent.Vivier@bull.net  ------------------
"Tout ce qui est impossible reste à accomplir"    Jules Verne
"Things are only impossible until they're not" Jean-Luc Picard

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-16 10:24             ` Laurent Vivier
@ 2008-10-16 13:43               ` Anthony Liguori
  2008-10-16 16:08                 ` Laurent Vivier
  2008-10-17 12:48                 ` Avi Kivity
  0 siblings, 2 replies; 101+ messages in thread
From: Anthony Liguori @ 2008-10-16 13:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper

Laurent Vivier wrote:
> Hi,
>
> I've made a benchmark using a database:
> mysql and sysbench in OLTP mode.
>
> cache=off seems to be the best choice in this case...
>   

It would be interesting for you to run the same workload under KVM.

> mysql database
> http://sysbench.sourceforge.net
>
> sysbench --test=oltp
>
> 200,000 requests on 2,000,000 rows table.
>
>                  | total time | per-request stat (ms) |
>                  |  (seconds) |  min  |  avg  |  max  |
> -----------------+------------+-------+-------+-------+
> baremetal        |   208.6237 |   2.5 |  16.7 | 942.6 |
> -----------------+------------+-------+-------+-------+
> cache=on         |   642.2962 |   2.5 |  51.4 | 326.9 |
> -----------------+------------+-------+-------+-------+
> cache=on,O_DSYNC |   646.6570 |   2.7 |  51.7 | 347.0 |
> -----------------+------------+-------+-------+-------+
> cache=off        |   635.4424 |   2.9 |  50.8 | 399.5 |
> -----------------+------------+-------+-------+-------+
>   

Because you're talking about 1/3% of native performance.  This means 
that you may be dominated by things like CPU overhead verses actual IO 
throughput.

Regards,

Anthony Liguori

> Laurent
>   

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-16 13:43               ` Anthony Liguori
@ 2008-10-16 16:08                 ` Laurent Vivier
  2008-10-17 12:48                 ` Avi Kivity
  1 sibling, 0 replies; 101+ messages in thread
From: Laurent Vivier @ 2008-10-16 16:08 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper

Le jeudi 16 octobre 2008 à 08:43 -0500, Anthony Liguori a écrit :
> Laurent Vivier wrote:
> > Hi,
> >
> > I've made a benchmark using a database:
> > mysql and sysbench in OLTP mode.
> >
> > cache=off seems to be the best choice in this case...
> >   
> 
> It would be interesting for you to run the same workload under KVM.

It is done under KVM... and I've just double checked these values.

> > mysql database
> > http://sysbench.sourceforge.net
> >
> > sysbench --test=oltp
> >
> > 200,000 requests on 2,000,000 rows table.
> >
> >                  | total time | per-request stat (ms) |
> >                  |  (seconds) |  min  |  avg  |  max  |
> > -----------------+------------+-------+-------+-------+
> > baremetal        |   208.6237 |   2.5 |  16.7 | 942.6 |
> > -----------------+------------+-------+-------+-------+
> > cache=on         |   642.2962 |   2.5 |  51.4 | 326.9 |
> > -----------------+------------+-------+-------+-------+
> > cache=on,O_DSYNC |   646.6570 |   2.7 |  51.7 | 347.0 |
> > -----------------+------------+-------+-------+-------+
> > cache=off        |   635.4424 |   2.9 |  50.8 | 399.5 |
> > -----------------+------------+-------+-------+-------+
> >   
> 
> Because you're talking about 1/3% of native performance.  This means 
> that you may be dominated by things like CPU overhead verses actual IO 
> throughput.

Yes, but as it is KVM I have no explanation...

I've another interesting result with scsi-generic :

-----------------+------------+-------+-------+-------+
scsi-generic     |   634.1303 |   2.8 |  50.7 | 308.6 |
-----------------+------------+-------+-------+-------+

Regards,
Laurent
-- 
------------------ Laurent.Vivier@bull.net  ------------------
"Tout ce qui est impossible reste à accomplir"    Jules Verne
"Things are only impossible until they're not" Jean-Luc Picard

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-16 13:43               ` Anthony Liguori
  2008-10-16 16:08                 ` Laurent Vivier
@ 2008-10-17 12:48                 ` Avi Kivity
  2008-10-17 13:17                   ` Laurent Vivier
  1 sibling, 1 reply; 101+ messages in thread
From: Avi Kivity @ 2008-10-17 12:48 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper

Anthony Liguori wrote:
>>
>>                  | total time | per-request stat (ms) |
>>                  |  (seconds) |  min  |  avg  |  max  |
>> -----------------+------------+-------+-------+-------+
>> baremetal        |   208.6237 |   2.5 |  16.7 | 942.6 |
>> -----------------+------------+-------+-------+-------+
>> cache=on         |   642.2962 |   2.5 |  51.4 | 326.9 |
>> -----------------+------------+-------+-------+-------+
>> cache=on,O_DSYNC |   646.6570 |   2.7 |  51.7 | 347.0 |
>> -----------------+------------+-------+-------+-------+
>> cache=off        |   635.4424 |   2.9 |  50.8 | 399.5 |
>> -----------------+------------+-------+-------+-------+
>>   
>
> Because you're talking about 1/3% of native performance.  This means
> that you may be dominated by things like CPU overhead verses actual IO
> throughput.

I don't know mysql well, but perhaps it sizes its internal cache to
system memory size, so baremetal has 4x the amount of cache.

If mysql uses mmap to access its data files, then it automatically
scales with system memory.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-17 12:48                 ` Avi Kivity
@ 2008-10-17 13:17                   ` Laurent Vivier
  0 siblings, 0 replies; 101+ messages in thread
From: Laurent Vivier @ 2008-10-17 13:17 UTC (permalink / raw)
  To: qemu-devel; +Cc: Chris Wright, Mark McLoughlin, Ryan Harper

Le vendredi 17 octobre 2008 à 14:48 +0200, Avi Kivity a écrit :
> Anthony Liguori wrote:
> >>
> >>                  | total time | per-request stat (ms) |
> >>                  |  (seconds) |  min  |  avg  |  max  |
> >> -----------------+------------+-------+-------+-------+
> >> baremetal        |   208.6237 |   2.5 |  16.7 | 942.6 |
> >> -----------------+------------+-------+-------+-------+
> >> cache=on         |   642.2962 |   2.5 |  51.4 | 326.9 |
> >> -----------------+------------+-------+-------+-------+
> >> cache=on,O_DSYNC |   646.6570 |   2.7 |  51.7 | 347.0 |
> >> -----------------+------------+-------+-------+-------+
> >> cache=off        |   635.4424 |   2.9 |  50.8 | 399.5 |
> >> -----------------+------------+-------+-------+-------+
> >>   
> >
> > Because you're talking about 1/3% of native performance.  This means
> > that you may be dominated by things like CPU overhead verses actual IO
> > throughput.
> 
> I don't know mysql well, but perhaps it sizes its internal cache to
> system memory size, so baremetal has 4x the amount of cache.
> 
> If mysql uses mmap to access its data files, then it automatically
> scales with system memory.

It is what I thought but no: I've approximately the same results with
"mem=512M".

Regards,
Laurent
-- 
------------------ Laurent.Vivier@bull.net  ------------------
"Tout ce qui est impossible reste à accomplir"    Jules Verne
"Things are only impossible until they're not" Jean-Luc Picard

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 19:43     ` Ryan Harper
  2008-10-13 20:21       ` Laurent Vivier
@ 2008-10-14 10:05       ` Kevin Wolf
  2008-10-14 14:32         ` Ryan Harper
  2008-10-14 16:37       ` Avi Kivity
  2 siblings, 1 reply; 101+ messages in thread
From: Kevin Wolf @ 2008-10-14 10:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Laurent Vivier

Ryan Harper schrieb:
> * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]:
>> What file type do you use (qcow2, raw ?).
> 
> Raw.

I guess the image is preallocated? What about sparse files (or qcow2,
anything that grows), do have numbers on those? In the past, I
experienced O_DIRECT to be horribly slow on them.

Well, looking at your numbers, they _are_ quite bad, so maybe it
actually was sparse. Then the preallocated case would be interesting.

Kevin

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-14 10:05       ` Kevin Wolf
@ 2008-10-14 14:32         ` Ryan Harper
  0 siblings, 0 replies; 101+ messages in thread
From: Ryan Harper @ 2008-10-14 14:32 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper, Laurent Vivier

* Kevin Wolf <kwolf@suse.de> [2008-10-14 05:10]:
> Ryan Harper schrieb:
> > * Laurent Vivier <laurent@lvivier.info> [2008-10-13 13:52]:
> >> What file type do you use (qcow2, raw ?).
> > 
> > Raw.
> 
> I guess the image is preallocated? What about sparse files (or qcow2,
> anything that grows), do have numbers on those? In the past, I
> experienced O_DIRECT to be horribly slow on them.
> 
> Well, looking at your numbers, they _are_ quite bad, so maybe it
> actually was sparse. Then the preallocated case would be interesting.

It was pre-allocated.  I'm incliding to think there is an alignment or
some sort of bug/edge-case in the write path to the file on top of the
lvm volume considering I don't see such horrible performance against the
file in host via O_DIRECT.  I imagine until I figure out the issue,
sparse or preallocated will perform the same.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 19:43     ` Ryan Harper
  2008-10-13 20:21       ` Laurent Vivier
  2008-10-14 10:05       ` Kevin Wolf
@ 2008-10-14 16:37       ` Avi Kivity
  2 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 16:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Laurent Vivier

Ryan Harper wrote:
> fio --name=guestrun --filename=/dev/vda --rw=write --bs=${SIZE}
> --ioengine=libaio --direct=1 --norandommap --numjobs=1 --group_reporting
> --thread --size=1g --write_lat_log --write_bw_log --iodepth=74
>
>   

How large is /dev/vda?

Also, I think you're doing sequential access, which means sequential
runs will improve as data is brought into cache.  I suggest random
access, with a very large /dev/vga.

>> OK, but in this case the size of the cache for "cache=off" is the size  
>> of the guest cache whereas in the other cases the size of the cache is  
>> the size of the guest cache + the size of the host cache, this is not  
>> fair...
>>     
>
> it isn't supposed to be fair, cache=off is O_DIRECT, we're reading from
> the device, we *want* to be able to lean on the host cache to read the
> data, pay once and benefit in other guests if possible.
>
>   

My assumption is that the memory would be better utilized in the guest
(which makes better eviction choices, and which is a lot closer to the
application).  We'd need to run fio in non-direct mode to show this.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
  2008-10-13 18:43   ` Anthony Liguori
  2008-10-13 18:51   ` Laurent Vivier
@ 2008-10-13 19:00   ` Mark Wagner
  2008-10-13 19:15     ` Ryan Harper
  2 siblings, 1 reply; 101+ messages in thread
From: Mark Wagner @ 2008-10-13 19:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

Ryan Harper wrote:
> * Anthony Liguori <anthony@codemonkey.ws> [2008-10-09 12:00]:
>   
>> Read performance should be unaffected by using O_DSYNC.  O_DIRECT will 
>> significantly reduce read performance.  I think we should use O_DSYNC by 
>> default and I have sent out a patch that contains that.  We will follow 
>> up with benchmarks to demonstrate this.
>>     
>
>
>  baremetal baseline (1g dataset):
>  ---------------------------+-------+-------+--------------+------------+
>  Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
>  type, block size, iface    | MB/s  | usage | latency usec | latency ms |
>  ---------------------------+-------+-------+--------------+------------+
>  write, 16k, lvm, direct=1  | 127.7 |  12   |   11.66      |    9.48    |
>  write, 64k, lvm, direct=1  | 178.4 |   5   |   13.65      |   27.15    |
>  write, 1M,  lvm, direct=1  | 186.0 |   3   |  163.75      |  416.91    |
>  ---------------------------+-------+-------+--------------+------------+
>  read , 16k, lvm, direct=1  | 170.4 |  15   |   10.86      |    7.10    |
>  read , 64k, lvm, direct=1  | 199.2 |   5   |   12.52      |   24.31    |
>  read , 1M,  lvm, direct=1  | 202.0 |   3   |  133.74      |  382.67    |
>  ---------------------------+-------+-------+--------------+------------+
>
>  kvm write (1g dataset):
>  ---------------------------+-------+-------+--------------+------------+
>  Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
>  block size,iface,cache,sync| MB/s  | usage | latency usec | latency ms |
>  ---------------------------+-------+-------+--------------+------------+
>  16k,virtio,off,none        | 135.0 |  94   |    9.1       |    8.71    |
>  16k,virtio,on ,none        | 184.0 | 100   |   63.69      |   63.48    |
>  16k,virtio,on ,O_DSYNC     | 150.0 |  35   |    6.63      |    8.31    |
>  ---------------------------+-------+-------+--------------+------------+
>  64k,virtio,off,none        | 169.0 |  51   |   17.10      |   28.00    |
>  64k,virtio,on ,none        | 189.0 |  60   |   69.42      |   24.92    |
>  64k,virtio,on ,O_DSYNC     | 171.0 |  48   |   18.83      |   27.72    |
>  ---------------------------+-------+-------+--------------+------------+
>  1M ,virtio,off,none        | 142.0 |  30   |  7176.00     |  523.00    |
>  1M ,virtio,on ,none        | 190.0 |  45   |  5332.63     |  392.35    |
>  1M ,virtio,on ,O_DSYNC     | 164.0 |  39   |  6444.48     |  471.20    |
>  ---------------------------+-------+-------+--------------+------------+
>
>  kvm read (1g dataset):
>  ---------------------------+-------+-------+--------------+------------+
>  Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
>  block size,iface,cache,sync| MB/s  | usage | latency usec | latency ms |
>  ---------------------------+-------+-------+--------------+------------+
>  16k,virtio,off,none        | 175.0 |  40   |   22.42      |    6.71    |
>  16k,virtio,on ,none        | 211.0 | 147   |   59.49      |    5.54    |
>  16k,virtio,on ,O_DSYNC     | 212.0 | 145   |   60.45      |    5.47    |
>  ---------------------------+-------+-------+--------------+------------+
>  64k,virtio,off,none        | 190.0 |  64   |   16.31      |   24.92    |
>  64k,virtio,on ,none        | 546.0 | 161   |  111.06      |    8.54    |
>  64k,virtio,on ,O_DSYNC     | 520.0 | 151   |  116.66      |    8.97    |
>  ---------------------------+-------+-------+--------------+------------+
>  1M ,virtio,off,none        | 182.0 |  32   | 5573.44      |  407.21    |
>  1M ,virtio,on ,none        | 750.0 | 127   | 1344.65      |   96.42    |
>  1M ,virtio,on ,O_DSYNC     | 768.0 | 123   | 1289.05      |   94.25    |
>  ---------------------------+-------+-------+--------------+------------+
>
>  --------------------------------------------------------------------------
>  exporting file in ext3 filesystem as block device (1g)
>  --------------------------------------------------------------------------
>
>  kvm write (1g dataset):
>  ---------------------------+-------+-------+--------------+------------+
>  Test scenarios             | bandw | % CPU | ave submit   | ave compl  |
>  block size,iface,cache,sync| MB/s  | usage | latency usec | latency ms |
>  ---------------------------+-------+-------+--------------+------------+
>  16k,virtio,off,none        |  12.1 |  15   |    9.1       |    8.71    |
>  16k,virtio,on ,none        | 192.0 |  52   |   62.52      |    6.17    |
>  16k,virtio,on ,O_DSYNC     | 142.0 |  59   |   18.81      |    8.29    |
>  ---------------------------+-------+-------+--------------+------------+
>  64k,virtio,off,none        |  15.5 |   8   |   21.10      |  311.00    |
>  64k,virtio,on ,none        | 454.0 | 130   |  113.25      |   10.65    |
>  64k,virtio,on ,O_DSYNC     | 154.0 |  48   |   20.25      |   30.75    |
>  ---------------------------+-------+-------+--------------+------------+
>  1M ,virtio,off,none        |  24.7 |   5   | 41736.22     | 3020.08    |
>  1M ,virtio,on ,none        | 485.0 | 100   |  2052.09     |  149.81    |
>  1M ,virtio,on ,O_DSYNC     | 161.0 |  42   |  6268.84     |  453.84    |
>  ---------------------------+-------+-------+--------------+------------+
>
>
> --
> Ryan Harper
> Software Engineer; Linux Technology Center
> IBM Corp., Austin, Tx
> (512) 838-9253   T/L: 678-9253
> ryanh@us.ibm.com
>
>
>   
Ryan

Can you please post the details of the guest and host configurations. 
 From seeing kvm write data that is greater than that of bare metal,
I would think that your test dataset is too small and not
exceeding that of the host cache size.

Our previous testing has shown that once you exceed the host cache
and cause the cache to flush, performance will drop to a point lower
than if you didn't use the cache in the first place.

Can you repeat the tests using a data set that is 2X the size of your
hosts memory and post the results for the community to see?

-mark

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 19:00   ` Mark Wagner
@ 2008-10-13 19:15     ` Ryan Harper
  2008-10-14 16:49       ` Avi Kivity
  0 siblings, 1 reply; 101+ messages in thread
From: Ryan Harper @ 2008-10-13 19:15 UTC (permalink / raw)
  To: Mark Wagner
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	qemu-devel, Ryan Harper

* Mark Wagner <mwagner@redhat.com> [2008-10-13 14:06]:
> Ryan Harper wrote:
> 
> Can you please post the details of the guest and host configurations. 

http://lists.gnu.org/archive/html/qemu-devel/2008-09/msg01115.html


> From seeing kvm write data that is greater than that of bare metal,
> I would think that your test dataset is too small and not
> exceeding that of the host cache size.

The size was chosen so it would fit in to demonstrate the crazy #'s seen
on cached writes without O_DSYNC.

> 
> Our previous testing has shown that once you exceed the host cache
> and cause the cache to flush, performance will drop to a point lower
> than if you didn't use the cache in the first place.
> 
> Can you repeat the tests using a data set that is 2X the size of your
> hosts memory and post the results for the community to see?

Yeah, I can generate those numbers as well.  Seeing your note about tons
of ESA and storage, feel free to generate your own #'s and post them for
the community as well; the more the merrier.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] Re: [RFC] Disk integrity in QEMU
  2008-10-13 19:15     ` Ryan Harper
@ 2008-10-14 16:49       ` Avi Kivity
  0 siblings, 0 replies; 101+ messages in thread
From: Avi Kivity @ 2008-10-14 16:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper, Mark Wagner

Ryan Harper wrote:
>> From seeing kvm write data that is greater than that of bare metal,
>> I would think that your test dataset is too small and not
>> exceeding that of the host cache size.
>>     
>
> The size was chosen so it would fit in to demonstrate the crazy #'s seen
> on cached writes without O_DSYNC.
>
>   

A disk that is smaller than host memory is hardly interesting.  Give the
memory to the guest and performance will jump to memory speed rather
than disk speed.



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
                   ` (5 preceding siblings ...)
  2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
@ 2008-10-13 17:58 ` Rik van Riel
  2008-10-13 18:22   ` Jamie Lokier
  2008-10-28 17:34 ` Ian Jackson
  7 siblings, 1 reply; 101+ messages in thread
From: Rik van Riel @ 2008-10-13 17:58 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori wrote:

> When cache=on, read requests may not actually go to the disk.  If a 
> previous read request (by some application on the system) has read the 
> same data, then it becomes a simple memcpy().  Also, the host IO 
> scheduler may do read ahead which means that the data may be available 
> from that. 

This can be as much of a data integrity problem as
asynchronous writes, if various qemu/kvm guests are
accessing the same disk image with a cluster filesystem
like GFS.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
@ 2008-10-13 18:22   ` Jamie Lokier
  2008-10-13 18:34     ` Rik van Riel
  0 siblings, 1 reply; 101+ messages in thread
From: Jamie Lokier @ 2008-10-13 18:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Rik van Riel wrote:
> >When cache=on, read requests may not actually go to the disk.  If a 
> >previous read request (by some application on the system) has read the 
> >same data, then it becomes a simple memcpy().  Also, the host IO 
> >scheduler may do read ahead which means that the data may be available 
> >from that. 
> 
> This can be as much of a data integrity problem as
> asynchronous writes, if various qemu/kvm guests are
> accessing the same disk image with a cluster filesystem
> like GFS.

If there are multiple qemu/kvm guests accessing the same disk image in
a cluster, provided the host cluster filesystem uses a fully coherent
protocol, ordinary cached reads should be fine.  (E.g. not NFS).

The behaviour should be equivalent to a "virtual SAN".

(Btw, some other OSes have an O_RSYNC flag to force reads to hit the
media, much as O_DSYNC forces writes to.  That might be relevant to
accessing a disk image file on non-coherent cluster filesystems, but I
wouldn't recommend that.)

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13 18:22   ` Jamie Lokier
@ 2008-10-13 18:34     ` Rik van Riel
  2008-10-14  1:56       ` Jamie Lokier
  0 siblings, 1 reply; 101+ messages in thread
From: Rik van Riel @ 2008-10-13 18:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Jamie Lokier wrote:
> Rik van Riel wrote:
>>> When cache=on, read requests may not actually go to the disk.  If a 
>>> previous read request (by some application on the system) has read the 
>>> same data, then it becomes a simple memcpy().  Also, the host IO 
>>> scheduler may do read ahead which means that the data may be available 
>> >from that. 
>>
>> This can be as much of a data integrity problem as
>> asynchronous writes, if various qemu/kvm guests are
>> accessing the same disk image with a cluster filesystem
>> like GFS.
> 
> If there are multiple qemu/kvm guests accessing the same disk image in
> a cluster, provided the host cluster filesystem uses a fully coherent
> protocol, ordinary cached reads should be fine.  (E.g. not NFS).

The problem is when the synchronization only happens in the guests,
which is a legitimate and common configuration.

Ie. the hosts just pass through the IO and the guests run a GFS
cluster.

Caching either reads or writes at the host level causes problems.

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-13 18:34     ` Rik van Riel
@ 2008-10-14  1:56       ` Jamie Lokier
  2008-10-14  2:28         ` nuitari-qemu
  0 siblings, 1 reply; 101+ messages in thread
From: Jamie Lokier @ 2008-10-14  1:56 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Rik van Riel wrote:
> >If there are multiple qemu/kvm guests accessing the same disk image in
> >a cluster, provided the host cluster filesystem uses a fully coherent
> >protocol, ordinary cached reads should be fine.  (E.g. not NFS).
> 
> The problem is when the synchronization only happens in the guests,
> which is a legitimate and common configuration.
> 
> Ie. the hosts just pass through the IO and the guests run a GFS
> cluster.

Ok, if you are using multiple hosts with a non-coherent host
filesystem for the virtual disk, or a non-coherent host block device
for the virtual disk, it won't work.

But why would you do that?

What is the legitimate and common configuration where you'd share a
virtual among multiple _hosts_ with a non-coherent host file/device
sharing protocol and expect it to work?

Do you envisage qemu/kvm using O_DIRECT over NFS or SMB on the host, or
something like that?

> Caching either reads or writes at the host level causes problems.

But only if the hosts are using a non-coherent protocol.  Not having a
visible effect (except timing) is pretty much the definition of
coherent caching.

Is there a reason why you wouldn't use, say, GFS on the host (because
it claims to be coherent)?  Does performance suck relative to O_DIRECT
over NFS?

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-14  1:56       ` Jamie Lokier
@ 2008-10-14  2:28         ` nuitari-qemu
  0 siblings, 0 replies; 101+ messages in thread
From: nuitari-qemu @ 2008-10-14  2:28 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

> Is there a reason why you wouldn't use, say, GFS on the host (because
> it claims to be coherent)?  Does performance suck relative to O_DIRECT
> over NFS?

Complexity?

To set up GFS2 you have to have a full cluster setup, get it working, make 
sure that locking works, that quorum is achieved, have failover and 
proper fencing working properly.

Plus you then have to maintain all of that.

Then you find out that GFS2 is not ready for production (deadlocks), GFS 
is too old to be supported by a recent kernel.

OCFS isn't easier either.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
                   ` (6 preceding siblings ...)
  2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
@ 2008-10-28 17:34 ` Ian Jackson
  2008-10-28 17:45   ` Anthony Liguori
  7 siblings, 1 reply; 101+ messages in thread
From: Ian Jackson @ 2008-10-28 17:34 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori writes ("[Qemu-devel] [RFC] Disk integrity in QEMU"):
> So to summarize, I think we should enable O_DSYNC by default to ensure 
> that guest data integrity is not dependent on the host OS, and that 
> practically speaking, cache=off is only useful for very specialized 
> circumstances.  Part of the patch I'll follow up with includes changes 
> to the man page to document all of this for users.

I have a patch which does this and allows the host to control the
buffering with the IDE cache control facility.

I'll be resubmitting it shortly (if I manage to get round to it before
going away for three weeks on Thursday lunchtime ...)

Ian.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-28 17:34 ` Ian Jackson
@ 2008-10-28 17:45   ` Anthony Liguori
  2008-10-28 17:50     ` Ian Jackson
  0 siblings, 1 reply; 101+ messages in thread
From: Anthony Liguori @ 2008-10-28 17:45 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Ian Jackson wrote:
> Anthony Liguori writes ("[Qemu-devel] [RFC] Disk integrity in QEMU"):
>   
>> So to summarize, I think we should enable O_DSYNC by default to ensure 
>> that guest data integrity is not dependent on the host OS, and that 
>> practically speaking, cache=off is only useful for very specialized 
>> circumstances.  Part of the patch I'll follow up with includes changes 
>> to the man page to document all of this for users.
>>     
>
> I have a patch which does this and allows the host to control the
> buffering with the IDE cache control facility.
>   

Do you mean that the guest can control host disk cachability?  We've 
switched to always use O_DSYNC by default.  There was a very long thread 
about it including benchmarks.  With the right posix-aio tuning, we can 
use O_DSYNC without hurting performance*.

* Write performance drops but only because write performance was greater 
than native before.  It now is at native performance.

Regards,

Anthony Liguori

> I'll be resubmitting it shortly (if I manage to get round to it before
> going away for three weeks on Thursday lunchtime ...)
>
> Ian.
>
>
>   

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-28 17:45   ` Anthony Liguori
@ 2008-10-28 17:50     ` Ian Jackson
  2008-10-28 18:19       ` Jamie Lokier
  0 siblings, 1 reply; 101+ messages in thread
From: Ian Jackson @ 2008-10-28 17:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, Laurent Vivier,
	kvm-devel

Anthony Liguori writes ("Re: [Qemu-devel] [RFC] Disk integrity in QEMU"):
> Do you mean that the guest can control host disk cachability?

Yes.

>  We've switched to always use O_DSYNC by default.  There was a very
> long thread about it including benchmarks.  With the right posix-aio
> tuning, we can use O_DSYNC without hurting performance*.

Right.

With the change in my tree, the guest can turn on the use of the
host's buffer cache for writes (ie, turn off the use of O_DSYNC),
using the appropriate cache control features in the IDE controller
(and have write barriers with the FLUSH CACHE command).

But this patch will need to be reworked into a coherent state for
resubmission because of the upstream changes you mention.

Ian.

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-28 17:50     ` Ian Jackson
@ 2008-10-28 18:19       ` Jamie Lokier
  0 siblings, 0 replies; 101+ messages in thread
From: Jamie Lokier @ 2008-10-28 18:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Ian Jackson wrote:
> >  We've switched to always use O_DSYNC by default.  There was a very
> > long thread about it including benchmarks.  With the right posix-aio
> > tuning, we can use O_DSYNC without hurting performance*.
> 
> Right.
> 
> With the change in my tree, the guest can turn on the use of the
> host's buffer cache for writes (ie, turn off the use of O_DSYNC),
> using the appropriate cache control features in the IDE controller
> (and have write barriers with the FLUSH CACHE command).

I think this is a good idea in principle, but it needs to be
overridable by command line and monitor controls.

There are a number of guests and usages where you'd want to override
it.  These come to mind:

   - Enable host write caching even though the guest turns off IDE
     caching, because you're testing something and speed is more
     important than what the guest requests, and you don't want to or
     can't change the guest.

   - Disable host write caching even though the guest turns on IDE
     caching, because you know the guest enables the IDE cache for
     speed and does not flush the IDE cache for integrity (e.g. some
     old Linux or Windows?), and you don't want to or can't change the
     guest.

   - Disable host read and write caching with O_DIRECT, even though
     the guest turns on IDE caching, because you want to emulate
     (roughly) a real disk's performance characteristics.

   - Disable host read and write caching with O_DIRECT because you
     don't have spare RAM after the guests have used it.

Note that O_DIRECT is not strictly "less caching" than O_DSYNC.
Guest IDE FLUSH CACHE commands become host fsync/fdatasync calls.
On some Linux hosts, O_DSYNC + fsync will result in a _host_ IDE FLUSH
CACHE, when O_DIRECT + fsync will not.

-- Jamie

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2008-10-28 18:19 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-10-09 17:00 [Qemu-devel] [RFC] Disk integrity in QEMU Anthony Liguori
2008-10-10  7:54 ` Gerd Hoffmann
2008-10-10  8:12   ` Mark McLoughlin
2008-10-12 23:10     ` Jamie Lokier
2008-10-14 17:15       ` Avi Kivity
2008-10-10  9:32   ` Avi Kivity
2008-10-12 23:00     ` Jamie Lokier
2008-10-10  8:11 ` Aurelien Jarno
2008-10-10 12:26   ` Anthony Liguori
2008-10-10 12:53     ` Paul Brook
2008-10-10 13:55       ` Anthony Liguori
2008-10-10 14:05         ` Paul Brook
2008-10-10 14:19         ` Avi Kivity
2008-10-17 13:14           ` Jens Axboe
2008-10-19  9:13             ` Avi Kivity
2008-10-10 15:48     ` Aurelien Jarno
2008-10-10  9:16 ` Avi Kivity
2008-10-10  9:58   ` Daniel P. Berrange
2008-10-10 10:26     ` Avi Kivity
2008-10-10 12:59       ` Paul Brook
2008-10-10 13:20         ` Avi Kivity
2008-10-10 12:34   ` Anthony Liguori
2008-10-10 12:56     ` Avi Kivity
2008-10-11  9:07     ` andrzej zaborowski
2008-10-11 17:54   ` Mark Wagner
2008-10-11 20:35     ` Anthony Liguori
2008-10-12  0:43       ` Mark Wagner
2008-10-12  1:50         ` Chris Wright
2008-10-12 16:22           ` Jamie Lokier
2008-10-12 17:54         ` Anthony Liguori
2008-10-12 18:14           ` nuitari-qemu
2008-10-13  0:27           ` Mark Wagner
2008-10-13  1:21             ` Anthony Liguori
2008-10-13  2:09               ` Mark Wagner
2008-10-13  3:16                 ` Anthony Liguori
2008-10-13  6:42                 ` Aurelien Jarno
2008-10-13 14:38                 ` Steve Ofsthun
2008-10-12  0:44       ` Chris Wright
2008-10-12 10:21         ` Avi Kivity
2008-10-12 14:37           ` Dor Laor
2008-10-12 15:35             ` Jamie Lokier
2008-10-12 18:00               ` Anthony Liguori
2008-10-12 18:02             ` Anthony Liguori
2008-10-15 10:17               ` Andrea Arcangeli
2008-10-12 17:59           ` Anthony Liguori
2008-10-12 18:34             ` Avi Kivity
2008-10-12 19:33               ` Izik Eidus
2008-10-14 17:08                 ` Avi Kivity
2008-10-12 19:59               ` Anthony Liguori
2008-10-12 20:43                 ` Avi Kivity
2008-10-12 21:11                   ` Anthony Liguori
2008-10-14 15:21                     ` Avi Kivity
2008-10-14 15:32                       ` Anthony Liguori
2008-10-14 15:43                         ` Avi Kivity
2008-10-14 19:25                       ` Laurent Vivier
2008-10-16  9:47                         ` Avi Kivity
2008-10-12 10:12       ` Avi Kivity
2008-10-17 13:20         ` Jens Axboe
2008-10-19  9:01           ` Avi Kivity
2008-10-19 18:10             ` Jens Axboe
2008-10-19 18:23               ` Avi Kivity
2008-10-19 19:17                 ` M. Warner Losh
2008-10-19 19:31                   ` Avi Kivity
2008-10-19 18:24               ` Avi Kivity
2008-10-19 18:36                 ` Jens Axboe
2008-10-19 19:11                   ` Avi Kivity
2008-10-19 19:30                     ` Jens Axboe
2008-10-19 20:16                       ` Avi Kivity
2008-10-20 14:14                       ` Avi Kivity
2008-10-10 10:03 ` Fabrice Bellard
2008-10-13 16:11 ` Laurent Vivier
2008-10-13 16:58   ` Anthony Liguori
2008-10-13 17:36     ` Jamie Lokier
2008-10-13 17:06 ` [Qemu-devel] " Ryan Harper
2008-10-13 18:43   ` Anthony Liguori
2008-10-14 16:42     ` Avi Kivity
2008-10-13 18:51   ` Laurent Vivier
2008-10-13 19:43     ` Ryan Harper
2008-10-13 20:21       ` Laurent Vivier
2008-10-13 21:05         ` Ryan Harper
2008-10-15 13:10           ` Laurent Vivier
2008-10-16 10:24             ` Laurent Vivier
2008-10-16 13:43               ` Anthony Liguori
2008-10-16 16:08                 ` Laurent Vivier
2008-10-17 12:48                 ` Avi Kivity
2008-10-17 13:17                   ` Laurent Vivier
2008-10-14 10:05       ` Kevin Wolf
2008-10-14 14:32         ` Ryan Harper
2008-10-14 16:37       ` Avi Kivity
2008-10-13 19:00   ` Mark Wagner
2008-10-13 19:15     ` Ryan Harper
2008-10-14 16:49       ` Avi Kivity
2008-10-13 17:58 ` [Qemu-devel] " Rik van Riel
2008-10-13 18:22   ` Jamie Lokier
2008-10-13 18:34     ` Rik van Riel
2008-10-14  1:56       ` Jamie Lokier
2008-10-14  2:28         ` nuitari-qemu
2008-10-28 17:34 ` Ian Jackson
2008-10-28 17:45   ` Anthony Liguori
2008-10-28 17:50     ` Ian Jackson
2008-10-28 18:19       ` Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).