qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] Storage requirements for live migration
@ 2011-11-11  0:11 Anthony Liguori
  2011-11-11  6:27 ` Mark Wu
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11  0:11 UTC (permalink / raw)
  To: qemu-devel
  Cc: Kevin Wolf, Juan Quintela, Christoph Hellwig, Avi Kivity,
	Stefan Hajnoczi

I did a brain dump of my understanding of the various storage requirements for 
live migration.  I think it's accurate but I may have misunderstand some details 
so I would appreciate review.

I think given sections (1) and (2), the only viable thing is to require 
cache=none unless we get new interfaces to flush caches.

Section (3) talks about image formats.  As I mentioned elsewhere in the thread, 
I think the best we can do right now is have a block layer interface to quiesce 
the image format.  I think reopen may be a viable short term strategy for qcow2 
but I think for raw, we should just make the quiesce operation a nop.

http://wiki.qemu.org/Migration/Storage

Inlined below for ease of review.

Regards,

Anthony Liguori

Migration in QEMU is designed assuming cache coherent shared storage and raw 
format block devices.  There are some cases where less migration will also work 
with more weakly coherent shared storage.  This wiki page attempts to outline 
those scenarios.  It also attempts to iterate through the reasons why various 
image formats do not support migration even with shared storage.

== NFS ==

=== Background ===

NFS only offers close-to-open cache coherence.  This means that the only 
guarantee provided by the protocol is that if you close a file in a client A and 
then open the file in another client B, client B will see client A's changes.

The way migration works in QEMU, the source stops the guest after it sends all 
of the required data but does not immediately free any resources.  This makes 
migration more reliable since it avoids the Two Generals Problem allowing a 
reliable third node to make the final decision about whether migration was 
successful.

As soon as the destination receives all of the data, it immediately starts the 
guest.  This means that the reliable third node is not in the critical path of 
migration downtime but can still recover a failed migration.

Since the source never knows that the destination is okay, the only way to 
support NFS robustly would be to close all files on the source before sending 
the last chunk of migration data.  This would mean that if any failure occurred 
after this point, the VM would be lost.

=== In Practice ===

A Linux NFS server that exports with 'sync' offers a stronger coherency than NFS 
guarantees.  This is an implementation detail, not a guarantee as far as I know. 
  If the client sends a read request, then any data that has been acknowledged 
done with a stable write by any other client will be returned without the need 
to close and reopen the file.

A file opened with O_DIRECT with the Linux NFS client code wil always issue a 
protocol read operation given a userspace read() call.  This means that if you 
issue stable writes (fsync) on the source and then use O_DIRECT to read on the 
destination, you can safely access the same file without reopening.

=== Conclusion ===

Migration with QEMU is safe, in practice, when using Linux as an NFS server and 
client when both the source and destination are using cache=none for the disks 
and a raw file.

== iSCSI/Direct Attached Storage ==

iSCSI has a similar cache coherency guarantee to direct attached storage (via 
fibre channel).  Any read request will return data that has been acknowledged as 
written by another client.

Since QEMU issues read() requests in userspace, Linux normally uses the page 
cache.  The Linux page cache is not coherent across multiple nodes so the only 
way to safely access storage coherently is to bypass the Linux page cache via 
cache=none.

=== Conclusion ===

iSCSI, FC, or other forms of direct attached storage are only safe to use with 
live migration if you use cache=none and a raw image.

== Clustered File Systems ==

Clustered File Systems such as GPFS, Ceph, Glusterfs, or GFS2 are safe to use 
with live migration regardless of the caching option use as long as raw images 
are used.

== Image Formats ==

Image formats are not safe to use with live migration.  The reason is that QEMU 
caches data for image formats and does not have a mechanism to flush those 
caches.  The following attempts to describe the issues with the various formats

=== QCOW2 ===

QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount table, 
etc) and mutable header information (file size, snapshot entries, etc).

This data needs to be discarded before after migration starts.

=== QED ===

QED caches similar data to QCOW2.  In addition, the QED header has a dirty flag 
that must be handled specially in the case of live migration.

=== Raw Files ===

Technically, the file size of a raw file is mutable metadata that QEMU caches. 
This is only applicable when using online image resizing.  If you avoid online 
image resizing during live migration, raw files are completely safe provided the 
storage used meets the above requirements.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
@ 2011-11-11  6:27 ` Mark Wu
  2011-11-11  9:15   ` Kevin Wolf
  2011-11-11  9:38 ` Kevin Wolf
  2011-11-11 22:43 ` Ryan Harper
  2 siblings, 1 reply; 10+ messages in thread
From: Mark Wu @ 2011-11-11  6:27 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
	Avi Kivity, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]

On 11/11/2011 08:11 AM, Anthony Liguori wrote:
> I did a brain dump of my understanding of the various storage 
> requirements for live migration.  I think it's accurate but I may have 
> misunderstand some details so I would appreciate review.
>
> I think given sections (1) and (2), the only viable thing is to 
> require cache=none unless we get new interfaces to flush caches.
I have two questions:
1. why is "cache=writethrough" not reliable?  "O_DSYNC" can guarantee 
data is written on disk before the operation is ultimately completed. 
Even if there's synchronization from memory to disk still on-going when 
migration happens,  we needn't care about the dirty pages in host memory 
either because the queued request will be re-executed on destination 
host.  So I think it's reliable too.  Do I miss anything?

2. Can we flush cache for "cache=writeback" in the last phase of 
migration? I think it can resolve the problem of "cache coherent". But 
it could cause an uncertain guest downtime.

Thanks.
Mark

[-- Attachment #2: Type: text/html, Size: 1966 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  6:27 ` Mark Wu
@ 2011-11-11  9:15   ` Kevin Wolf
  0 siblings, 0 replies; 10+ messages in thread
From: Kevin Wolf @ 2011-11-11  9:15 UTC (permalink / raw)
  To: Mark Wu
  Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, Avi Kivity,
	Christoph Hellwig

Am 11.11.2011 07:27, schrieb Mark Wu:
> On 11/11/2011 08:11 AM, Anthony Liguori wrote:
>> I did a brain dump of my understanding of the various storage
>> requirements for live migration.  I think it's accurate but I may have
>> misunderstand some details so I would appreciate review.
>>
>> I think given sections (1) and (2), the only viable thing is to
>> require cache=none unless we get new interfaces to flush caches.
> I have two questions:
> 1. why is "cache=writethrough" not reliable?  "O_DSYNC" can guarantee
> data is written on disk before the operation is ultimately completed.
> Even if there's synchronization from memory to disk still on-going when
> migration happens,  we needn't care about the dirty pages in host memory
> either because the queued request will be re-executed on destination
> host.  So I think it's reliable too.  Do I miss anything?

The problem is about reading. The destination host reads some pieces of
information during its initialisation, which may later be changed on the
host. After the migration has completed, the destination must not use
the old data that it read previously and that might be in its cache, but
it must access the storage. O_DSYNC doesn't help with that, as reading
still uses the cache there.

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
  2011-11-11  6:27 ` Mark Wu
@ 2011-11-11  9:38 ` Kevin Wolf
  2011-11-11  9:55   ` Daniel P. Berrange
  2011-11-11 14:05   ` Anthony Liguori
  2011-11-11 22:43 ` Ryan Harper
  2 siblings, 2 replies; 10+ messages in thread
From: Kevin Wolf @ 2011-11-11  9:38 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Juan Quintela, Christoph Hellwig, qemu-devel, Stefan Hajnoczi,
	Avi Kivity

Am 11.11.2011 01:11, schrieb Anthony Liguori:
> I did a brain dump of my understanding of the various storage requirements for 
> live migration.  I think it's accurate but I may have misunderstand some details 
> so I would appreciate review.
> 
> I think given sections (1) and (2), the only viable thing is to require 
> cache=none unless we get new interfaces to flush caches.

Yes, I think we should strongly recommend cache=none/directsync, but not
enforce it. As you said, for clustered filesystems other options should
work, so we should allow users to choose to make use of that.

> Section (3) talks about image formats.  As I mentioned elsewhere in the thread, 
> I think the best we can do right now is have a block layer interface to quiesce 
> the image format.  I think reopen may be a viable short term strategy for qcow2 
> but I think for raw, we should just make the quiesce operation a nop.

I don't agree with making it a nop on raw. You say it's safe without any
further action if you avoid online resize, but we don't disallow online
resize during migration, so this is an invalid assumption. And did you
review all the code to make sure that we don't cache more?

I believe it's better anyway to have all formats behave the same.

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  9:38 ` Kevin Wolf
@ 2011-11-11  9:55   ` Daniel P. Berrange
  2011-11-11 10:01     ` Kevin Wolf
  2011-11-11 14:08     ` Anthony Liguori
  2011-11-11 14:05   ` Anthony Liguori
  1 sibling, 2 replies; 10+ messages in thread
From: Daniel P. Berrange @ 2011-11-11  9:55 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, Avi Kivity,
	Christoph Hellwig

On Fri, Nov 11, 2011 at 10:38:20AM +0100, Kevin Wolf wrote:
> Am 11.11.2011 01:11, schrieb Anthony Liguori:
> > I did a brain dump of my understanding of the various storage requirements for 
> > live migration.  I think it's accurate but I may have misunderstand some details 
> > so I would appreciate review.
> > 
> > I think given sections (1) and (2), the only viable thing is to require 
> > cache=none unless we get new interfaces to flush caches.
> 
> Yes, I think we should strongly recommend cache=none/directsync, but not
> enforce it. As you said, for clustered filesystems other options should
> work, so we should allow users to choose to make use of that.

WRT libvirt, we have a concept of 'tainting' for guests. We set taint
flags whenever the management application requests a config, or performs
an action that we know to be potentially dangerous. These end up as log
messages in the per-guest logfile, so when users report bugs we can see
from the log that something "bad" has been done to the guest.

At the very least, it sounds like we should make libvirt mark guests as
tainted, if they have been migrated with cache != none, so this is easily
identifiable by BZ support people.

We might also want to make a libvirt host level config option to allow
host admins forbid migration without cache=none.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  9:55   ` Daniel P. Berrange
@ 2011-11-11 10:01     ` Kevin Wolf
  2011-11-11 14:08     ` Anthony Liguori
  1 sibling, 0 replies; 10+ messages in thread
From: Kevin Wolf @ 2011-11-11 10:01 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, Avi Kivity,
	Christoph Hellwig

Am 11.11.2011 10:55, schrieb Daniel P. Berrange:
> On Fri, Nov 11, 2011 at 10:38:20AM +0100, Kevin Wolf wrote:
>> Am 11.11.2011 01:11, schrieb Anthony Liguori:
>>> I did a brain dump of my understanding of the various storage requirements for 
>>> live migration.  I think it's accurate but I may have misunderstand some details 
>>> so I would appreciate review.
>>>
>>> I think given sections (1) and (2), the only viable thing is to require 
>>> cache=none unless we get new interfaces to flush caches.
>>
>> Yes, I think we should strongly recommend cache=none/directsync, but not
>> enforce it. As you said, for clustered filesystems other options should
>> work, so we should allow users to choose to make use of that.
> 
> WRT libvirt, we have a concept of 'tainting' for guests. We set taint
> flags whenever the management application requests a config, or performs
> an action that we know to be potentially dangerous. These end up as log
> messages in the per-guest logfile, so when users report bugs we can see
> from the log that something "bad" has been done to the guest.
> 
> At the very least, it sounds like we should make libvirt mark guests as
> tainted, if they have been migrated with cache != none, so this is easily
> identifiable by BZ support people.
> 
> We might also want to make a libvirt host level config option to allow
> host admins forbid migration without cache=none.

Yes, I think this would be a good approach.

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  9:38 ` Kevin Wolf
  2011-11-11  9:55   ` Daniel P. Berrange
@ 2011-11-11 14:05   ` Anthony Liguori
  1 sibling, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 14:05 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, Avi Kivity, Christoph Hellwig, Stefan Hajnoczi,
	Juan Quintela

On 11/11/2011 03:38 AM, Kevin Wolf wrote:
> Am 11.11.2011 01:11, schrieb Anthony Liguori:
>> I did a brain dump of my understanding of the various storage requirements for
>> live migration.  I think it's accurate but I may have misunderstand some details
>> so I would appreciate review.
>>
>> I think given sections (1) and (2), the only viable thing is to require
>> cache=none unless we get new interfaces to flush caches.
>
> Yes, I think we should strongly recommend cache=none/directsync, but not
> enforce it. As you said, for clustered filesystems other options should
> work, so we should allow users to choose to make use of that.
>
>> Section (3) talks about image formats.  As I mentioned elsewhere in the thread,
>> I think the best we can do right now is have a block layer interface to quiesce
>> the image format.  I think reopen may be a viable short term strategy for qcow2
>> but I think for raw, we should just make the quiesce operation a nop.
>
> I don't agree with making it a nop on raw. You say it's safe without any
> further action if you avoid online resize, but we don't disallow online
> resize during migration, so this is an invalid assumption. And did you
> review all the code to make sure that we don't cache more?

It's not just that.  We cache information outside of the block layer (like the 
CHS).  This means that reopening isn't even enough because that only drops the 
cached data in the block layer.

The file size is also cached by devices.

Regards,

Anthony Liguori

>
> I believe it's better anyway to have all formats behave the same.
>
> Kevin
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  9:55   ` Daniel P. Berrange
  2011-11-11 10:01     ` Kevin Wolf
@ 2011-11-11 14:08     ` Anthony Liguori
  1 sibling, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 14:08 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
	Avi Kivity, Christoph Hellwig

On 11/11/2011 03:55 AM, Daniel P. Berrange wrote:
> On Fri, Nov 11, 2011 at 10:38:20AM +0100, Kevin Wolf wrote:
>> Am 11.11.2011 01:11, schrieb Anthony Liguori:
>>> I did a brain dump of my understanding of the various storage requirements for
>>> live migration.  I think it's accurate but I may have misunderstand some details
>>> so I would appreciate review.
>>>
>>> I think given sections (1) and (2), the only viable thing is to require
>>> cache=none unless we get new interfaces to flush caches.
>>
>> Yes, I think we should strongly recommend cache=none/directsync, but not
>> enforce it. As you said, for clustered filesystems other options should
>> work, so we should allow users to choose to make use of that.
>
> WRT libvirt, we have a concept of 'tainting' for guests. We set taint
> flags whenever the management application requests a config, or performs
> an action that we know to be potentially dangerous. These end up as log
> messages in the per-guest logfile, so when users report bugs we can see
> from the log that something "bad" has been done to the guest.
>
> At the very least, it sounds like we should make libvirt mark guests as
> tainted, if they have been migrated with cache != none, so this is easily
> identifiable by BZ support people.
>
> We might also want to make a libvirt host level config option to allow
> host admins forbid migration without cache=none.

It might make more sense to make it a property of the storage pool.  That is, a 
storage pool should have a notion of whether it supports migration and what 
constraints (if any) are given on its migration support.

Regards,

Anthony Liguori

>
> Daniel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11  0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
  2011-11-11  6:27 ` Mark Wu
  2011-11-11  9:38 ` Kevin Wolf
@ 2011-11-11 22:43 ` Ryan Harper
  2011-11-11 23:23   ` Anthony Liguori
  2 siblings, 1 reply; 10+ messages in thread
From: Ryan Harper @ 2011-11-11 22:43 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
	Avi Kivity, Christoph Hellwig

> == Image Formats ==
> 
> Image formats are not safe to use with live migration.  The reason is that 
> QEMU caches data for image formats and does not have a mechanism to flush 
> those caches.  The following attempts to describe the issues with the 
> various formats
> 
> === QCOW2 ===
> 
> QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount 
> table, etc) and mutable header information (file size, snapshot entries, 
> etc).
> 
> This data needs to be discarded before after migration starts.

before after?

pick one =)


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Storage requirements for live migration
  2011-11-11 22:43 ` Ryan Harper
@ 2011-11-11 23:23   ` Anthony Liguori
  0 siblings, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 23:23 UTC (permalink / raw)
  To: Ryan Harper
  Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
	Avi Kivity, Christoph Hellwig

On 11/11/2011 04:43 PM, Ryan Harper wrote:
>> == Image Formats ==
>>
>> Image formats are not safe to use with live migration.  The reason is that
>> QEMU caches data for image formats and does not have a mechanism to flush
>> those caches.  The following attempts to describe the issues with the
>> various formats
>>
>> === QCOW2 ===
>>
>> QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount
>> table, etc) and mutable header information (file size, snapshot entries,
>> etc).
>>
>> This data needs to be discarded before after migration starts.
>
> before after?
>
> pick one =)

Er, the sentence is intelligible with either word.  I've updated the wiki with 
something that actually makes sense :-)

Regards,

Anthony Liguori

>
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-11-11 23:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-11  0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
2011-11-11  6:27 ` Mark Wu
2011-11-11  9:15   ` Kevin Wolf
2011-11-11  9:38 ` Kevin Wolf
2011-11-11  9:55   ` Daniel P. Berrange
2011-11-11 10:01     ` Kevin Wolf
2011-11-11 14:08     ` Anthony Liguori
2011-11-11 14:05   ` Anthony Liguori
2011-11-11 22:43 ` Ryan Harper
2011-11-11 23:23   ` Anthony Liguori

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).