From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([140.186.70.92]:35915)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dlaor@redhat.com>) id 1Qe6Rq-0008I6-Db
	for qemu-devel@nongnu.org; Tue, 05 Jul 2011 10:18:00 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dlaor@redhat.com>) id 1Qe6Rn-00055V-Gn
	for qemu-devel@nongnu.org; Tue, 05 Jul 2011 10:17:58 -0400
Received: from mx1.redhat.com ([209.132.183.28]:63512)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dlaor@redhat.com>) id 1Qe6Rm-000554-MJ
	for qemu-devel@nongnu.org; Tue, 05 Jul 2011 10:17:55 -0400
Message-ID: <4E131D0D.307@redhat.com>
Date: Tue, 05 Jul 2011 17:17:49 +0300
From: Dor Laor <dlaor@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: [Qemu-devel] live block copy/stream/snapshot discussion
Reply-To: dlaor@redhat.com
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel <qemu-devel@nongnu.org>, Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>, Marcelo Tosatti <mtosatti@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Avi Kivity <avi@redhat.com>, Anthony Liguori <aliguori@us.ibm.com>

Anthony advised to clone 
http://wiki.qemu.org/index.php?title=Features/LiveBlockMigrationFuture 
to the list in order to encourage discussion, so here it is:
------------------------------------------------------------------------
  qemu is expected to support these features (some already implemented):

= Live features =

== Live block copy ==

    Ability to copy 1+ virtual disk from the source backing file/block
    device to a new target that is accessible by the host. The copy
    supposed to be executed while the VM runs in a transparent way.

== Live snapshots and live snapshot merge ==

    Live snapshot is already incorporated (by Jes) in qemu (still need
    virt-agent work to freeze the guest FS).
    Live snapshot merge is required in order of reducing the overhead
    caused by the additional snapshots (sometimes over raw device).
    We'll use live copy to do the live merge

== Image streaming (Copy on read) ==
    Ability to start guest execution while the parent image reside
    remotely and each block access is replicated to a local copy (image
    format snapshot)
    Such functionality can be hooked together with live block migration
    instead of the 'post copy' method.

== Live block migration (pre/post) ==

    Beyond live block copy we'll sometimes need to move both the storage
    and the guest. There are two main approached here:
    - pre copy
      First live copy the image and only then live migration the VM.
      It is simple and safer approach in terms of management app, but if
      the purpose of the whole live block migration was to balance the
      cpu load, it won't be practical to use since copying an image of
      100GB will take too long.
    - post copy (streaming / copy on read)
      First live migrate the VM, then on line stream its blocks.
      It's better approach for HA/load balancing but it might make
      management complex (need to keep the source VM alive, handling
      failures)

    In addition there are two cases for the storage access:

    1. Shared storage
       Live block copy enable this capability, its seems like a rare
       case for live block migration.
    2. There are some cases where the is no NFS/SAN storage and live
       migration is needed. It should be similar to VMW's storage VM
       motion.
       http://www.vmware.com/files/pdf/VMware-Storage-VMotion-DS-EN.pdf
       http://www.vmware.com/products/storage-vmotion/features.html

== Using external dirty block bitmap ==

    FVD has an option to use external dirty block bitmap file in
    addition to the regular mapping/data files.
    We can consider using it for live block migration and live merge too.
    It can also allow additional usages of 3rd party tools to calculate
    diffs between the snapshots.
    There is a big down side thought since it will make management
    complicated and there is the risky of the image and its bitmap file
    get out of sync. It's much better choice to have qemu-img tool to be
    the single interface to the dirty block bitmap data.

= Solutions =

== Non shared storage ==

    Either use iscsi (target and initiator) or NBD or proprietary qemu
    solution. iScsi in theory is the best but there is a problem of
    dealing with COW images - iScsi cannot report the COW level and
    detect un-allocated blocks. This might force us to use
    proprietary solution.
    An interesting option (by Orit Wasserman) was to use iScsi for
    exporting the images externally to qemu level and qemu will access
    as if they were a local device. This can work well w/o almost any
    effort. What do we do with chains of COW files? We create up to N
    such iscsi connections for every COW file in the chain.

== Live block migration ==

    Use the streaming approach + regular live migration + iscsi:
    Execute regular live migration and at the end of it, start streaming.
    If there is no shared storage, use the external iscsi and behave as
    if the image is local. At the end of the streaming operation there
    will be a new local base image.

== Block mirror layer ==

    Was invented in order to duplicate write IOs for the source and
    destination images. It prevents the potential race when both qemu
    and the management crash at the end of the block copy stage and it
    is unknown whether management should pick the source or the
    destination

== Streaming ==

    No need for mirror since only the destination changes and is
    writable.

== Block copy background task ==

    Can be shared between block copy and streaming

== Live snapshot ==

    It can be seen as a (local) stream that preserve the current COW
    chain

= Use cases =

  1. Basic streaming, single base master image on source storage, need
     to be instantiated on destination storage

      The base image is a single level COW format (file or lvm).
      The base is RO and only new destination is RW. base' is empty at
      the beginning. The base image content is being copied in the
      background to base'. At the end of the operation, base' is a
      standalone image w/o depending on the base image.

      a. Case of a shared storage streaming guest boot

      Before:           src storage: base             dst storage: none
      After             src storage: base             dst storage: base'

      b. Case of no shared storage streaming guest boot
         Every thing is the same, we use external iscsi target on the
         src host and external iscsi initiator on the destination host.
         Qemu boots from the destination by using the iscsi access. This
         is transparent to qemu (expect cmd syntax change ). Once the
         streaming is over, we can live drop the usage of iscsi and open
         the image directly (some sort of null live copy)

      c. Live block migration (using streaming) w/ shared storage.
         Exactly like 1.a. First create the destination image, then we
         run live migration there w/o data in the new image. Now we
         stream like the boot scenario.

      d. Live block migration (using streaming) w/o shared storage.
         Like 1.b. + 1.c.

      *** There is complexity to handle multiple block device belonging
      to the same VM. Management will need to track each stream finish
      event and manage failures accordingly.

  2. Basic streaming of raw files/devices

     Here we have an issue - what happens if there is a failure in the
     middle? Regular COW can sustain a failure since the intermediate
     base' contains information dirty bit block information. Such a
     base' intermediate raw image will be broken. We cannot revert back
     to the original base and start over because new writes were written
     only to the base'.

     Approaches:
     a. Don't support that
     b. Use intermediate COW image and then live copy it into raw (waste
        time, IO, space). One can easily add new COW over the source and
        continue from there.
     c. Use external metadata of dirty-block-bitmap even for raw

     Suggestion: at this stage, do either recommendation #a or #b


  3. Basic live copy, single base master image on source storage, need
     to be copied to the destination storage

     The base image is a single level COW format or a raw file/device.
     The base image content is being copied in the background to base'.
     At the end of the operation, base' is a standalone image w/o
     depending on the base image. In this case we only take into account
     a running VM, no need to do that for boot stage.
     So it is either VM running locally and about to change its storage
     or a VM live migration. The plan is to use the mirror driver
     approach. Both src/dst are writable.

      a. Case of a shared storage, a VM changes its block device

      Before:           src storage: base             dst storage: none
      After             src storage: base             dst storage: base'

      This is a plain live copy w/o moving the VM.
      The case w/o shared storage seems not relevant here.
      We might want to move multiple block devices of the VM.
      It is written here for completeness - it shouldn't change anything.
      Still management/events will use the block name/id.

      b. Live block migration (w/o streaming) w/ shared storage.
         Unlike in the streaming case, the order here is reversed:
         Run live copy. When it ends and we're in the mirror state, run
         live migration. When it ends, stop the mirroring and make the
         VM continue on the destination.
         That's probably a rare use case.

      c. Live block migration (using streaming) w/o shared storage.
         Like 3.b. by using external iscsi


  4. COW chains that preserve the full structure

     Before:  src: base <- sn1 <- snx    dst: none
     After:   src: base <- sn1 <- snx    dst: base' <- sn1' <- snx'

     All of the original snapshot chains should be copied or stream as
     is to the new storage. With copying we can do all of the non leaf
     images using standard 'cp tools'.
     If we're to use iscsi, we'll need to create N such connections.
     Probably not a common use case for streaming, we might ignore this
     and use this scenario only for copying.

  5. Like 4. but the chain can collapse. In fact this is like special
     case of #4

      Before:src: base <- sn1 <- sn2 .. <- snx  dst: none
      After: src: base <- sn1 <- sn2 ...<- snx  dst: base'<-sn1'..<- sny'

     There is no difference from #4 other than collapsing some chain
     path into the dst leaf

  6. Live snapshot

     It's here since the interface can be similar. Basically it is
     similar to live copy but instead of copying, we switch to another
     COW on top. The only (separate) addition would
     be to add a verb to ask the guest to flush its file systems.

     Before:           storage: base <- s1 <- sx
     After             storage: base <- s1 <- sx <-sx+1

== Exceptions ==

  1. Hot unplug of the relevant disk
     Prevent that. (or cancel the operation)

  1. Live migration in the middle of non migration action from above
     Shall we allow it? It can work but at the end of live migration we
     need to reopen the images (NFS mainly), it might add un-needed
     complexity.
     We better prevent that.


= Interface =

== Streaming (by Stefan) ==

   1. Start a background streaming operation:
     (qemu) block_stream -a ide0-hd

   2. Check the status of the operation:
     (qemu) info block-stream
     Streaming device ide0-hd: Completed 512 of 34359738368 bytes

   3. The status changes when the operation completes:
     (qemu) info block-stream
     No active stream

   On completion the image file no longer has a backing file dependency.
   When streaming completes QEMU updates the image file metadata to
   indicate that no backing file is used.
   The QMP interface is similar but provides QMP events to signal
   streaming completion and failure.  Polling to query the streaming
   status is only used when the management application wishes to refresh
   progress information.
   If guest execution is interrupted by a power failure or QEMU crash,
   then the image file is still valid but streaming may be incomplete.
   When QEMU is launched again the block_stream command can be issued to
   resume streaming.

-----------------
Cheers,
Dor