* [Qemu-devel] Storage requirements for live migration
@ 2011-11-11 0:11 Anthony Liguori
2011-11-11 6:27 ` Mark Wu
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 0:11 UTC (permalink / raw)
To: qemu-devel
Cc: Kevin Wolf, Juan Quintela, Christoph Hellwig, Avi Kivity,
Stefan Hajnoczi
I did a brain dump of my understanding of the various storage requirements for
live migration. I think it's accurate but I may have misunderstand some details
so I would appreciate review.
I think given sections (1) and (2), the only viable thing is to require
cache=none unless we get new interfaces to flush caches.
Section (3) talks about image formats. As I mentioned elsewhere in the thread,
I think the best we can do right now is have a block layer interface to quiesce
the image format. I think reopen may be a viable short term strategy for qcow2
but I think for raw, we should just make the quiesce operation a nop.
http://wiki.qemu.org/Migration/Storage
Inlined below for ease of review.
Regards,
Anthony Liguori
Migration in QEMU is designed assuming cache coherent shared storage and raw
format block devices. There are some cases where less migration will also work
with more weakly coherent shared storage. This wiki page attempts to outline
those scenarios. It also attempts to iterate through the reasons why various
image formats do not support migration even with shared storage.
== NFS ==
=== Background ===
NFS only offers close-to-open cache coherence. This means that the only
guarantee provided by the protocol is that if you close a file in a client A and
then open the file in another client B, client B will see client A's changes.
The way migration works in QEMU, the source stops the guest after it sends all
of the required data but does not immediately free any resources. This makes
migration more reliable since it avoids the Two Generals Problem allowing a
reliable third node to make the final decision about whether migration was
successful.
As soon as the destination receives all of the data, it immediately starts the
guest. This means that the reliable third node is not in the critical path of
migration downtime but can still recover a failed migration.
Since the source never knows that the destination is okay, the only way to
support NFS robustly would be to close all files on the source before sending
the last chunk of migration data. This would mean that if any failure occurred
after this point, the VM would be lost.
=== In Practice ===
A Linux NFS server that exports with 'sync' offers a stronger coherency than NFS
guarantees. This is an implementation detail, not a guarantee as far as I know.
If the client sends a read request, then any data that has been acknowledged
done with a stable write by any other client will be returned without the need
to close and reopen the file.
A file opened with O_DIRECT with the Linux NFS client code wil always issue a
protocol read operation given a userspace read() call. This means that if you
issue stable writes (fsync) on the source and then use O_DIRECT to read on the
destination, you can safely access the same file without reopening.
=== Conclusion ===
Migration with QEMU is safe, in practice, when using Linux as an NFS server and
client when both the source and destination are using cache=none for the disks
and a raw file.
== iSCSI/Direct Attached Storage ==
iSCSI has a similar cache coherency guarantee to direct attached storage (via
fibre channel). Any read request will return data that has been acknowledged as
written by another client.
Since QEMU issues read() requests in userspace, Linux normally uses the page
cache. The Linux page cache is not coherent across multiple nodes so the only
way to safely access storage coherently is to bypass the Linux page cache via
cache=none.
=== Conclusion ===
iSCSI, FC, or other forms of direct attached storage are only safe to use with
live migration if you use cache=none and a raw image.
== Clustered File Systems ==
Clustered File Systems such as GPFS, Ceph, Glusterfs, or GFS2 are safe to use
with live migration regardless of the caching option use as long as raw images
are used.
== Image Formats ==
Image formats are not safe to use with live migration. The reason is that QEMU
caches data for image formats and does not have a mechanism to flush those
caches. The following attempts to describe the issues with the various formats
=== QCOW2 ===
QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount table,
etc) and mutable header information (file size, snapshot entries, etc).
This data needs to be discarded before after migration starts.
=== QED ===
QED caches similar data to QCOW2. In addition, the QED header has a dirty flag
that must be handled specially in the case of live migration.
=== Raw Files ===
Technically, the file size of a raw file is mutable metadata that QEMU caches.
This is only applicable when using online image resizing. If you avoid online
image resizing during live migration, raw files are completely safe provided the
storage used meets the above requirements.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
@ 2011-11-11 6:27 ` Mark Wu
2011-11-11 9:15 ` Kevin Wolf
2011-11-11 9:38 ` Kevin Wolf
2011-11-11 22:43 ` Ryan Harper
2 siblings, 1 reply; 10+ messages in thread
From: Mark Wu @ 2011-11-11 6:27 UTC (permalink / raw)
To: Anthony Liguori
Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
Avi Kivity, Christoph Hellwig
[-- Attachment #1: Type: text/plain, Size: 1017 bytes --]
On 11/11/2011 08:11 AM, Anthony Liguori wrote:
> I did a brain dump of my understanding of the various storage
> requirements for live migration. I think it's accurate but I may have
> misunderstand some details so I would appreciate review.
>
> I think given sections (1) and (2), the only viable thing is to
> require cache=none unless we get new interfaces to flush caches.
I have two questions:
1. why is "cache=writethrough" not reliable? "O_DSYNC" can guarantee
data is written on disk before the operation is ultimately completed.
Even if there's synchronization from memory to disk still on-going when
migration happens, we needn't care about the dirty pages in host memory
either because the queued request will be re-executed on destination
host. So I think it's reliable too. Do I miss anything?
2. Can we flush cache for "cache=writeback" in the last phase of
migration? I think it can resolve the problem of "cache coherent". But
it could cause an uncertain guest downtime.
Thanks.
Mark
[-- Attachment #2: Type: text/html, Size: 1966 bytes --]
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 6:27 ` Mark Wu
@ 2011-11-11 9:15 ` Kevin Wolf
0 siblings, 0 replies; 10+ messages in thread
From: Kevin Wolf @ 2011-11-11 9:15 UTC (permalink / raw)
To: Mark Wu
Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, Avi Kivity,
Christoph Hellwig
Am 11.11.2011 07:27, schrieb Mark Wu:
> On 11/11/2011 08:11 AM, Anthony Liguori wrote:
>> I did a brain dump of my understanding of the various storage
>> requirements for live migration. I think it's accurate but I may have
>> misunderstand some details so I would appreciate review.
>>
>> I think given sections (1) and (2), the only viable thing is to
>> require cache=none unless we get new interfaces to flush caches.
> I have two questions:
> 1. why is "cache=writethrough" not reliable? "O_DSYNC" can guarantee
> data is written on disk before the operation is ultimately completed.
> Even if there's synchronization from memory to disk still on-going when
> migration happens, we needn't care about the dirty pages in host memory
> either because the queued request will be re-executed on destination
> host. So I think it's reliable too. Do I miss anything?
The problem is about reading. The destination host reads some pieces of
information during its initialisation, which may later be changed on the
host. After the migration has completed, the destination must not use
the old data that it read previously and that might be in its cache, but
it must access the storage. O_DSYNC doesn't help with that, as reading
still uses the cache there.
Kevin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
2011-11-11 6:27 ` Mark Wu
@ 2011-11-11 9:38 ` Kevin Wolf
2011-11-11 9:55 ` Daniel P. Berrange
2011-11-11 14:05 ` Anthony Liguori
2011-11-11 22:43 ` Ryan Harper
2 siblings, 2 replies; 10+ messages in thread
From: Kevin Wolf @ 2011-11-11 9:38 UTC (permalink / raw)
To: Anthony Liguori
Cc: Juan Quintela, Christoph Hellwig, qemu-devel, Stefan Hajnoczi,
Avi Kivity
Am 11.11.2011 01:11, schrieb Anthony Liguori:
> I did a brain dump of my understanding of the various storage requirements for
> live migration. I think it's accurate but I may have misunderstand some details
> so I would appreciate review.
>
> I think given sections (1) and (2), the only viable thing is to require
> cache=none unless we get new interfaces to flush caches.
Yes, I think we should strongly recommend cache=none/directsync, but not
enforce it. As you said, for clustered filesystems other options should
work, so we should allow users to choose to make use of that.
> Section (3) talks about image formats. As I mentioned elsewhere in the thread,
> I think the best we can do right now is have a block layer interface to quiesce
> the image format. I think reopen may be a viable short term strategy for qcow2
> but I think for raw, we should just make the quiesce operation a nop.
I don't agree with making it a nop on raw. You say it's safe without any
further action if you avoid online resize, but we don't disallow online
resize during migration, so this is an invalid assumption. And did you
review all the code to make sure that we don't cache more?
I believe it's better anyway to have all formats behave the same.
Kevin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 9:38 ` Kevin Wolf
@ 2011-11-11 9:55 ` Daniel P. Berrange
2011-11-11 10:01 ` Kevin Wolf
2011-11-11 14:08 ` Anthony Liguori
2011-11-11 14:05 ` Anthony Liguori
1 sibling, 2 replies; 10+ messages in thread
From: Daniel P. Berrange @ 2011-11-11 9:55 UTC (permalink / raw)
To: Kevin Wolf
Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, Avi Kivity,
Christoph Hellwig
On Fri, Nov 11, 2011 at 10:38:20AM +0100, Kevin Wolf wrote:
> Am 11.11.2011 01:11, schrieb Anthony Liguori:
> > I did a brain dump of my understanding of the various storage requirements for
> > live migration. I think it's accurate but I may have misunderstand some details
> > so I would appreciate review.
> >
> > I think given sections (1) and (2), the only viable thing is to require
> > cache=none unless we get new interfaces to flush caches.
>
> Yes, I think we should strongly recommend cache=none/directsync, but not
> enforce it. As you said, for clustered filesystems other options should
> work, so we should allow users to choose to make use of that.
WRT libvirt, we have a concept of 'tainting' for guests. We set taint
flags whenever the management application requests a config, or performs
an action that we know to be potentially dangerous. These end up as log
messages in the per-guest logfile, so when users report bugs we can see
from the log that something "bad" has been done to the guest.
At the very least, it sounds like we should make libvirt mark guests as
tainted, if they have been migrated with cache != none, so this is easily
identifiable by BZ support people.
We might also want to make a libvirt host level config option to allow
host admins forbid migration without cache=none.
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 9:55 ` Daniel P. Berrange
@ 2011-11-11 10:01 ` Kevin Wolf
2011-11-11 14:08 ` Anthony Liguori
1 sibling, 0 replies; 10+ messages in thread
From: Kevin Wolf @ 2011-11-11 10:01 UTC (permalink / raw)
To: Daniel P. Berrange
Cc: Stefan Hajnoczi, Juan Quintela, qemu-devel, Avi Kivity,
Christoph Hellwig
Am 11.11.2011 10:55, schrieb Daniel P. Berrange:
> On Fri, Nov 11, 2011 at 10:38:20AM +0100, Kevin Wolf wrote:
>> Am 11.11.2011 01:11, schrieb Anthony Liguori:
>>> I did a brain dump of my understanding of the various storage requirements for
>>> live migration. I think it's accurate but I may have misunderstand some details
>>> so I would appreciate review.
>>>
>>> I think given sections (1) and (2), the only viable thing is to require
>>> cache=none unless we get new interfaces to flush caches.
>>
>> Yes, I think we should strongly recommend cache=none/directsync, but not
>> enforce it. As you said, for clustered filesystems other options should
>> work, so we should allow users to choose to make use of that.
>
> WRT libvirt, we have a concept of 'tainting' for guests. We set taint
> flags whenever the management application requests a config, or performs
> an action that we know to be potentially dangerous. These end up as log
> messages in the per-guest logfile, so when users report bugs we can see
> from the log that something "bad" has been done to the guest.
>
> At the very least, it sounds like we should make libvirt mark guests as
> tainted, if they have been migrated with cache != none, so this is easily
> identifiable by BZ support people.
>
> We might also want to make a libvirt host level config option to allow
> host admins forbid migration without cache=none.
Yes, I think this would be a good approach.
Kevin
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 9:55 ` Daniel P. Berrange
2011-11-11 10:01 ` Kevin Wolf
@ 2011-11-11 14:08 ` Anthony Liguori
1 sibling, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 14:08 UTC (permalink / raw)
To: Daniel P. Berrange
Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
Avi Kivity, Christoph Hellwig
On 11/11/2011 03:55 AM, Daniel P. Berrange wrote:
> On Fri, Nov 11, 2011 at 10:38:20AM +0100, Kevin Wolf wrote:
>> Am 11.11.2011 01:11, schrieb Anthony Liguori:
>>> I did a brain dump of my understanding of the various storage requirements for
>>> live migration. I think it's accurate but I may have misunderstand some details
>>> so I would appreciate review.
>>>
>>> I think given sections (1) and (2), the only viable thing is to require
>>> cache=none unless we get new interfaces to flush caches.
>>
>> Yes, I think we should strongly recommend cache=none/directsync, but not
>> enforce it. As you said, for clustered filesystems other options should
>> work, so we should allow users to choose to make use of that.
>
> WRT libvirt, we have a concept of 'tainting' for guests. We set taint
> flags whenever the management application requests a config, or performs
> an action that we know to be potentially dangerous. These end up as log
> messages in the per-guest logfile, so when users report bugs we can see
> from the log that something "bad" has been done to the guest.
>
> At the very least, it sounds like we should make libvirt mark guests as
> tainted, if they have been migrated with cache != none, so this is easily
> identifiable by BZ support people.
>
> We might also want to make a libvirt host level config option to allow
> host admins forbid migration without cache=none.
It might make more sense to make it a property of the storage pool. That is, a
storage pool should have a notion of whether it supports migration and what
constraints (if any) are given on its migration support.
Regards,
Anthony Liguori
>
> Daniel
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 9:38 ` Kevin Wolf
2011-11-11 9:55 ` Daniel P. Berrange
@ 2011-11-11 14:05 ` Anthony Liguori
1 sibling, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 14:05 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-devel, Avi Kivity, Christoph Hellwig, Stefan Hajnoczi,
Juan Quintela
On 11/11/2011 03:38 AM, Kevin Wolf wrote:
> Am 11.11.2011 01:11, schrieb Anthony Liguori:
>> I did a brain dump of my understanding of the various storage requirements for
>> live migration. I think it's accurate but I may have misunderstand some details
>> so I would appreciate review.
>>
>> I think given sections (1) and (2), the only viable thing is to require
>> cache=none unless we get new interfaces to flush caches.
>
> Yes, I think we should strongly recommend cache=none/directsync, but not
> enforce it. As you said, for clustered filesystems other options should
> work, so we should allow users to choose to make use of that.
>
>> Section (3) talks about image formats. As I mentioned elsewhere in the thread,
>> I think the best we can do right now is have a block layer interface to quiesce
>> the image format. I think reopen may be a viable short term strategy for qcow2
>> but I think for raw, we should just make the quiesce operation a nop.
>
> I don't agree with making it a nop on raw. You say it's safe without any
> further action if you avoid online resize, but we don't disallow online
> resize during migration, so this is an invalid assumption. And did you
> review all the code to make sure that we don't cache more?
It's not just that. We cache information outside of the block layer (like the
CHS). This means that reopening isn't even enough because that only drops the
cached data in the block layer.
The file size is also cached by devices.
Regards,
Anthony Liguori
>
> I believe it's better anyway to have all formats behave the same.
>
> Kevin
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
2011-11-11 6:27 ` Mark Wu
2011-11-11 9:38 ` Kevin Wolf
@ 2011-11-11 22:43 ` Ryan Harper
2011-11-11 23:23 ` Anthony Liguori
2 siblings, 1 reply; 10+ messages in thread
From: Ryan Harper @ 2011-11-11 22:43 UTC (permalink / raw)
To: Anthony Liguori
Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
Avi Kivity, Christoph Hellwig
> == Image Formats ==
>
> Image formats are not safe to use with live migration. The reason is that
> QEMU caches data for image formats and does not have a mechanism to flush
> those caches. The following attempts to describe the issues with the
> various formats
>
> === QCOW2 ===
>
> QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount
> table, etc) and mutable header information (file size, snapshot entries,
> etc).
>
> This data needs to be discarded before after migration starts.
before after?
pick one =)
--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
ryanh@us.ibm.com
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Qemu-devel] Storage requirements for live migration
2011-11-11 22:43 ` Ryan Harper
@ 2011-11-11 23:23 ` Anthony Liguori
0 siblings, 0 replies; 10+ messages in thread
From: Anthony Liguori @ 2011-11-11 23:23 UTC (permalink / raw)
To: Ryan Harper
Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela, qemu-devel,
Avi Kivity, Christoph Hellwig
On 11/11/2011 04:43 PM, Ryan Harper wrote:
>> == Image Formats ==
>>
>> Image formats are not safe to use with live migration. The reason is that
>> QEMU caches data for image formats and does not have a mechanism to flush
>> those caches. The following attempts to describe the issues with the
>> various formats
>>
>> === QCOW2 ===
>>
>> QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount
>> table, etc) and mutable header information (file size, snapshot entries,
>> etc).
>>
>> This data needs to be discarded before after migration starts.
>
> before after?
>
> pick one =)
Er, the sentence is intelligible with either word. I've updated the wiki with
something that actually makes sense :-)
Regards,
Anthony Liguori
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2011-11-11 23:23 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-11 0:11 [Qemu-devel] Storage requirements for live migration Anthony Liguori
2011-11-11 6:27 ` Mark Wu
2011-11-11 9:15 ` Kevin Wolf
2011-11-11 9:38 ` Kevin Wolf
2011-11-11 9:55 ` Daniel P. Berrange
2011-11-11 10:01 ` Kevin Wolf
2011-11-11 14:08 ` Anthony Liguori
2011-11-11 14:05 ` Anthony Liguori
2011-11-11 22:43 ` Ryan Harper
2011-11-11 23:23 ` Anthony Liguori
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).