[Qemu-devel] Storage requirements for live migration

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anthony Liguori <anthony@codemonkey.ws>
To: qemu-devel <qemu-devel@nongnu.org>
Cc: Kevin Wolf <kwolf@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Avi Kivity <avi@redhat.com>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Subject: [Qemu-devel] Storage requirements for live migration
Date: Thu, 10 Nov 2011 18:11:40 -0600	[thread overview]
Message-ID: <4EBC683C.7090700@codemonkey.ws> (raw)

I did a brain dump of my understanding of the various storage requirements for 
live migration.  I think it's accurate but I may have misunderstand some details 
so I would appreciate review.

I think given sections (1) and (2), the only viable thing is to require 
cache=none unless we get new interfaces to flush caches.

Section (3) talks about image formats.  As I mentioned elsewhere in the thread, 
I think the best we can do right now is have a block layer interface to quiesce 
the image format.  I think reopen may be a viable short term strategy for qcow2 
but I think for raw, we should just make the quiesce operation a nop.

http://wiki.qemu.org/Migration/Storage

Inlined below for ease of review.

Regards,

Anthony Liguori

Migration in QEMU is designed assuming cache coherent shared storage and raw 
format block devices.  There are some cases where less migration will also work 
with more weakly coherent shared storage.  This wiki page attempts to outline 
those scenarios.  It also attempts to iterate through the reasons why various 
image formats do not support migration even with shared storage.

== NFS ==

=== Background ===

NFS only offers close-to-open cache coherence.  This means that the only 
guarantee provided by the protocol is that if you close a file in a client A and 
then open the file in another client B, client B will see client A's changes.

The way migration works in QEMU, the source stops the guest after it sends all 
of the required data but does not immediately free any resources.  This makes 
migration more reliable since it avoids the Two Generals Problem allowing a 
reliable third node to make the final decision about whether migration was 
successful.

As soon as the destination receives all of the data, it immediately starts the 
guest.  This means that the reliable third node is not in the critical path of 
migration downtime but can still recover a failed migration.

Since the source never knows that the destination is okay, the only way to 
support NFS robustly would be to close all files on the source before sending 
the last chunk of migration data.  This would mean that if any failure occurred 
after this point, the VM would be lost.

=== In Practice ===

A Linux NFS server that exports with 'sync' offers a stronger coherency than NFS 
guarantees.  This is an implementation detail, not a guarantee as far as I know. 
  If the client sends a read request, then any data that has been acknowledged 
done with a stable write by any other client will be returned without the need 
to close and reopen the file.

A file opened with O_DIRECT with the Linux NFS client code wil always issue a 
protocol read operation given a userspace read() call.  This means that if you 
issue stable writes (fsync) on the source and then use O_DIRECT to read on the 
destination, you can safely access the same file without reopening.

=== Conclusion ===

Migration with QEMU is safe, in practice, when using Linux as an NFS server and 
client when both the source and destination are using cache=none for the disks 
and a raw file.

== iSCSI/Direct Attached Storage ==

iSCSI has a similar cache coherency guarantee to direct attached storage (via 
fibre channel).  Any read request will return data that has been acknowledged as 
written by another client.

Since QEMU issues read() requests in userspace, Linux normally uses the page 
cache.  The Linux page cache is not coherent across multiple nodes so the only 
way to safely access storage coherently is to bypass the Linux page cache via 
cache=none.

=== Conclusion ===

iSCSI, FC, or other forms of direct attached storage are only safe to use with 
live migration if you use cache=none and a raw image.

== Clustered File Systems ==

Clustered File Systems such as GPFS, Ceph, Glusterfs, or GFS2 are safe to use 
with live migration regardless of the caching option use as long as raw images 
are used.

== Image Formats ==

Image formats are not safe to use with live migration.  The reason is that QEMU 
caches data for image formats and does not have a mechanism to flush those 
caches.  The following attempts to describe the issues with the various formats

=== QCOW2 ===

QCOW2 caches two forms of data, cluster metadata (L1/L2 data, refcount table, 
etc) and mutable header information (file size, snapshot entries, etc).

This data needs to be discarded before after migration starts.

=== QED ===

QED caches similar data to QCOW2.  In addition, the QED header has a dirty flag 
that must be handled specially in the case of live migration.

=== Raw Files ===

Technically, the file size of a raw file is mutable metadata that QEMU caches. 
This is only applicable when using online image resizing.  If you avoid online 
image resizing during live migration, raw files are completely safe provided the 
storage used meets the above requirements.

next             reply	other threads:[~2011-11-11  0:11 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-11-11  0:11 Anthony Liguori [this message]
2011-11-11  6:27 ` [Qemu-devel] Storage requirements for live migration Mark Wu
2011-11-11  9:15   ` Kevin Wolf
2011-11-11  9:38 ` Kevin Wolf
2011-11-11  9:55   ` Daniel P. Berrange
2011-11-11 10:01     ` Kevin Wolf
2011-11-11 14:08     ` Anthony Liguori
2011-11-11 14:05   ` Anthony Liguori
2011-11-11 22:43 ` Ryan Harper
2011-11-11 23:23   ` Anthony Liguori

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4EBC683C.7090700@codemonkey.ws \
    --to=anthony@codemonkey.ws \
    --cc=avi@redhat.com \
    --cc=hch@lst.de \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.