Re: [Qemu-devel] [RFC] Disk integrity in QEMU

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
       [not found]       ` <48F14814.7000805@redhat.com>
@ 2008-10-12  1:50         ` Chris Wright
  2008-10-12 16:22           ` Jamie Lokier
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Wright @ 2008-10-12  1:50 UTC (permalink / raw)
  To: Mark Wagner
  Cc: qemu-devel, Chris Wright, Mark McLoughlin, Ryan Harper,
	Laurent Vivier, kvm

* Mark Wagner (mwagner@redhat.com) wrote:
> I think that are two distinct arguments going on here. My main concern is
> that I don't think that this a simple "what do we make the default cache policy
> be" issue. I think that regardless of the cache policy, if something in the
> guest requests O_DIRECT, the host must honor that and not cache the data.

OK, O_DIRECT in the guest is just one example of the guest requesting
data to be synchronously written to disk.  It bypasses guest page cache,
but even page cached writes need to be written at some point.  Any time
the disk driver issues an io where it expects the data to be on disk
(possible low-level storage subystem caching) is the area of concern.

* Mark Wagner (mwagner@redhat.com) wrote:
> Anthony Liguori wrote:
>> It's extremely important to understand what the guarantee is.  The  
>> guarantee is that upon completion on write(), the data will have been  
>> reported as written by the underlying storage subsystem.  This does  
>> *not* mean that the data is on disk.
>
> I apologize if I worded it poorly, I assume that the guarantee is that
> the data has been sent to the storage controller and said controller
> sent an indication that the write has completed.  This could mean
> multiple things likes its in the controllers cache, on the disk, etc.
>
> I do not believe that this means that the data is still sitting in the
> host cache.  I realize it may not yet be on a disk, but, at a minimum,
> I would expect that is has been sent to the storage controller.  Do you
> consider the hosts cache to be part of the storage subsystem ?

Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get
it through to host's storage subsytem, and I think that's been the core
of the discussion (plus defaults, etc).

>> In the case of KVM, even using write-back caching with the host page  
>> cache, we are still honoring the guarantee of O_DIRECT.  We just have  
>> another level of caching that happens to be write-back.
>
> I still don't get it.  If I have something running on the host that I
> open with O_DIRECT, do you still consider it not to be a violation of
> the system call if that data ends up in the host cache instead of being
> sent to the storage controller?

I suppose an argument could be made for host caching and write-back
to be considered part of the storage subsystem from the guest pov, but
then we also need to bring in the requirement for proper cache flushing.
Given a popular linux guest fs can be a little fast and loose, wb and
flushing isn't really optimal choice for the integrity case.

thanks,
-chris

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
       [not found]         ` <48F1CF9E.9030500@redhat.com>
@ 2008-10-12 14:37           ` Dor Laor
  2008-10-12 15:35             ` Jamie Lokier
  2008-10-12 18:02             ` Anthony Liguori
       [not found]           ` <48F23AF1.2000104@codemonkey.ws>
  1 sibling, 2 replies; 9+ messages in thread
From: Dor Laor @ 2008-10-12 14:37 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

Avi Kivity wrote:
> Chris Wright wrote:
>> I think it's safe to say the perf folks are concerned w/ data integrity
>> first, stable/reproducible results second, and raw performance third.
>>
>> So seeing data cached in host was simply not what they expected.  I 
>> think
>> write through is sufficient.  However I think that uncached vs. wt will
>> show up on the radar under reproducible results (need to tune based on
>> cache size).  And in most overcommit scenarios memory is typically more
>> precious than cpu, it's unclear to me if the extra buffering is anything
>> other than memory overhead.  As long as it's configurable then it's
>> comparable and benchmarking and best practices can dictate best choice.
>>   
>
> Getting good performance because we have a huge amount of free memory 
> in the host is not a good benchmark.  Under most circumstances, the 
> free memory will be used either for more guests, or will be given to 
> the existing guests, which can utilize it more efficiently than the host.
>
> I can see two cases where this is not true:
>
> - using older, 32-bit guests which cannot utilize all of the cache.  I 
> think Windows XP is limited to 512MB of cache, and usually doesn't 
> utilize even that.  So if you have an application running on 32-bit 
> Windows (or on 32-bit Linux with pae disabled), and a huge host, you 
> will see a significant boost from cache=writethrough.  This is a case 
> where performance can exceed native, simply because native cannot 
> exploit all the resources of the host.
>
> - if cache requirements vary in time across the different guests, and 
> if some smart ballooning is not in place, having free memory on the 
> host means we utilize it for whichever guest has the greatest need, so 
> overall performance improves.
>
>
>
Another justification for ODIRECT is that many production system will 
use the base images for their VMs.
It's mainly true for desktop virtualization but probably for some server 
virtualization deployments.
In these type of scenarios, we can have all of the base image chain 
opened as default with caching for read-only while the
leaf images are open with cache=off.
Since there is ongoing effort (both by IT and developers) to keep the 
base images as big as possible, it guarantees that
this data is best suited for caching in the host while the private leaf 
images will be uncached.
This way we provide good performance and caching for the shared parent 
images while also promising correctness.
Actually this is what happens on mainline qemu with cache=off.

Cheers,
Dor

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 14:37           ` Dor Laor
@ 2008-10-12 15:35             ` Jamie Lokier
  2008-10-12 18:00               ` Anthony Liguori
  2008-10-12 18:02             ` Anthony Liguori
  1 sibling, 1 reply; 9+ messages in thread
From: Jamie Lokier @ 2008-10-12 15:35 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Dor Laor wrote:
> Actually this is what happens on mainline qemu with cache=off.

Have I understood right that cache=off on a qcow2 image only uses
O_DIRECT for the leaf image, and the chain of base images don't use
O_DIRECT?

Sometimes on a memory constrained host, where the (collective) guest
memory is nearly as big as the host memory, I'm not sure this is what
I want.

-- Jamie

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12  1:50         ` [Qemu-devel] [RFC] Disk integrity in QEMU Chris Wright
@ 2008-10-12 16:22           ` Jamie Lokier
  0 siblings, 0 replies; 9+ messages in thread
From: Jamie Lokier @ 2008-10-12 16:22 UTC (permalink / raw)
  To: qemu-devel
  Cc: Mark Wagner, Chris Wright, Mark McLoughlin, kvm, Laurent Vivier,
	Ryan Harper

Chris Wright wrote:
> Either wt or uncached (so host O_DSYNC or O_DIRECT) would suffice to get
> it through to host's storage subsytem, and I think that's been the core
> of the discussion (plus defaults, etc).

Just want to point out that the storage commitment from O_DIRECT can
be _weaker_ than O_DSYNC.

On Linux,m O_DIRECT never uses storage-device barriers or
transactions, but O_DSYNC sometimes does, and fsync is even more
likely to than O_DSYNC.

I'm not certain, but I think the same applies to other host OSes too -
including Windows, which has its own equivalents to O_DSYNC and
O_DIRECT, and extra documented semantics when they are used together.

Although this is a host implementation detail, unfortunately it means
that O_DIRECT=no-cache and O_DSYNC=write-through-cache is not an
accurate characterisation.

Some might be mislead into assuming that "cache=off" is as strongly
committing their data to hard storage as "cache=wb" would.

I think you can assume this only when the underlying storage devices'
write caches are disabled.  You cannot assume this if the host
filesystem uses barriers instead of disabling the storage devices'
write cache.

Unfortunately there's not a lot qemu can do about these various quirks,
but at least it should be documented, so that someone requiring
storage commitment (e.g. for a critical guest database) is advised to
investigate whether O_DIRECT and/or O_DSYNC give them what they
require with their combination of host kernel, filesystem, filesystem
options and storage device(s).

-- Jamie

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 15:35             ` Jamie Lokier
@ 2008-10-12 18:00               ` Anthony Liguori
  0 siblings, 0 replies; 9+ messages in thread
From: Anthony Liguori @ 2008-10-12 18:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, Ryan Harper, kvm-devel,
	Laurent Vivier

Jamie Lokier wrote:
> Dor Laor wrote:
>   
>> Actually this is what happens on mainline qemu with cache=off.
>>     
>
> Have I understood right that cache=off on a qcow2 image only uses
> O_DIRECT for the leaf image, and the chain of base images don't use
> O_DIRECT?
>   

Yeah, that's a bug IMHO and in my patch to add O_DSYNC, I fix that.  I 
think an argument for O_DIRECT in a leaf and wb in the leaf is seriously 
flawed...

Regards,

Anthony Liguori

> Sometimes on a memory constrained host, where the (collective) guest
> memory is nearly as big as the host memory, I'm not sure this is what
> I want.
>
> -- Jamie
>
>
>   


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 14:37           ` Dor Laor
  2008-10-12 15:35             ` Jamie Lokier
@ 2008-10-12 18:02             ` Anthony Liguori
  2008-10-15 10:17               ` Andrea Arcangeli
  1 sibling, 1 reply; 9+ messages in thread
From: Anthony Liguori @ 2008-10-12 18:02 UTC (permalink / raw)
  To: Dor Laor
  Cc: qemu-devel, Chris Wright, Mark McLoughlin, kvm-devel,
	Laurent Vivier, Ryan Harper

Dor Laor wrote:
> Avi Kivity wrote:
>
> Since there is ongoing effort (both by IT and developers) to keep the 
> base images as big as possible, it guarantees that
> this data is best suited for caching in the host while the private 
> leaf images will be uncached.

A proper CAS solution is really such a better approach.  qcow2 
deduplification is an interesting concept, but such a hack :-)

> This way we provide good performance and caching for the shared parent 
> images while also promising correctness.

You get correctness by using O_DSYNC.  cache=off should disable the use 
of the page cache everywhere.

Regards,

Anthony Liguori

> Actually this is what happens on mainline qemu with cache=off.
>
> Cheers,
> Dor
> -- 
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
       [not found]             ` <48F24320.9010201@redhat.com>
@ 2008-10-12 19:33               ` Izik Eidus
  2008-10-14 17:08                 ` Avi Kivity
  0 siblings, 1 reply; 9+ messages in thread
From: Izik Eidus @ 2008-10-12 19:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: Anthony Liguori, Chris Wright, Mark McLoughlin, kvm-devel,
	Laurent Vivier, Ryan Harper

Avi Kivity wrote:
>
> LRU typically makes fairly bad decisions since it throws most of the
> information it has away.  I recommend looking up LRU-K and similar
> algorithms, just to get a feel for this; it is basically the simplest
> possible algorithm short of random selection.
>
> Note that Linux doesn't even have an LRU; it has to approximate since it
> can't sample all of the pages all of the time.  With a hypervisor that
> uses Intel's EPT, it's even worse since we don't have an accessed bit.
> On silly benchmarks that just exercise the disk and touch no memory, and
> if you tune the host very aggresively, LRU will win on long running
> guests since it will eventually page out all unused guest memory (with
> Linux guests, it will never even page guest memory in).  On real life
> applications I don't think there is much chance.
>
>   
But when using O_DIRECT you actuality make the pages not swappable at all...
or am i wrong?
maybe somekind of combination with the mm shrink could be good,
do_try_to_free_pages is good point for reference.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 19:33               ` Izik Eidus
@ 2008-10-14 17:08                 ` Avi Kivity
  0 siblings, 0 replies; 9+ messages in thread
From: Avi Kivity @ 2008-10-14 17:08 UTC (permalink / raw)
  To: qemu-devel
  Cc: Chris Wright, Mark McLoughlin, kvm-devel, Laurent Vivier,
	Ryan Harper

Izik Eidus wrote: 
> But when using O_DIRECT you actuality make the pages not swappable at
> all...
> or am i wrong?

Only for the duration of the I/O operation, which is typically very short.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Qemu-devel] [RFC] Disk integrity in QEMU
  2008-10-12 18:02             ` Anthony Liguori
@ 2008-10-15 10:17               ` Andrea Arcangeli
  0 siblings, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2008-10-15 10:17 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Dor Laor, qemu-devel, Chris Wright, Mark McLoughlin, kvm-devel,
	Laurent Vivier, Ryan Harper

[-- Attachment #1: Type: text/plain, Size: 2164 bytes --]

On Sun, Oct 12, 2008 at 01:02:57PM -0500, Anthony Liguori wrote:
> You get correctness by using O_DSYNC.  cache=off should disable the use of 
> the page cache everywhere.

The parent shared image is generally readonly (assuming no cluster fs
or shared database storage). So O_DSYNC on the parent will be a noop
but it's ok if you like it as a default.

By default having cache enabled on the parent makes sense to me
(O_DSYNC doesn't disable the cache like O_DIRECT does, reads are
cached). Because the qemu command line is qcow2 internals agnostic
(you can't specify which parent/child image to use, that's left to
qemu-img to set on the qcow2 metadata) I guess the O_DIRECT/O_DSYNC
behavior on the parent image should also be left to qemu-img. Assuming
there's any reserved bitflag left in the qcow2 metadata to use to
specify those bits.

I also attached the results of my o_direct measurements. O_DIRECT
seems very optimal already after the fixes to qcow2 to avoid
submitting aio_read/write only large as a qcow2 cluster size. I was
initially fooled because I didn't reduce the ram on the host to the
guest size + less than the min filesize of iozone, after that O_DIRECT
wins. All tests were run with the emulated ide driver, which is the
one that soldice is using right now with non-linux guest. The
aio-thread patch can't make any difference with ide as verified here.

I also tried to enlarge the max dma in the ide driver to 512k (it's
limited to 128k) but I couldn't measure any benefit. 128k large DMA on
host seems enough to reach platter speed.

I also tried with dma disabled on the guest ide driver, and that
destroys the O_DIRECT performance because then the commands are too
small to reach platter speed. The host IDE driver needs something
>=64k to reach platter speed.

In short I think except for the boot-time O_DIRECT is a must and
things like this are why MAP_SHARED isn't nearly as good as O_DIRECT
for certain cases, as it won't waste any cpu in the VM pagetable
manglings and msyncing. So the parent image is the only one where it
makes sense to allow caching to speed up the boot time and application
startup on the shared executables.

[-- Attachment #2: iozone-cleo-trunk-dma.ods --]
[-- Type: application/vnd.oasis.opendocument.spreadsheet, Size: 37205 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2008-10-15 10:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <48EE38B9.2050106@codemonkey.ws>
     [not found] ` <48EF1D55.7060307@redhat.com>
     [not found]   ` <48F0E83E.2000907@redhat.com>
     [not found]     ` <48F10DFD.40505@codemonkey.ws>
     [not found]       ` <48F14814.7000805@redhat.com>
2008-10-12  1:50         ` [Qemu-devel] [RFC] Disk integrity in QEMU Chris Wright
2008-10-12 16:22           ` Jamie Lokier
     [not found]       ` <20081012004401.GA9763@acer.localdomain>
     [not found]         ` <48F1CF9E.9030500@redhat.com>
2008-10-12 14:37           ` Dor Laor
2008-10-12 15:35             ` Jamie Lokier
2008-10-12 18:00               ` Anthony Liguori
2008-10-12 18:02             ` Anthony Liguori
2008-10-15 10:17               ` Andrea Arcangeli
     [not found]           ` <48F23AF1.2000104@codemonkey.ws>
     [not found]             ` <48F24320.9010201@redhat.com>
2008-10-12 19:33               ` Izik Eidus
2008-10-14 17:08                 ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).