All of lore.kernel.org
 help / color / mirror / Atom feed
From: Anthony Liguori <aliguori@linux.vnet.ibm.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: "libvir-list@redhat.com" <libvir-list@redhat.com>,
	qemu-devel <qemu-devel@nongnu.org>,
	Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration
Date: Tue, 07 Sep 2010 09:49:50 -0500	[thread overview]
Message-ID: <4C86510E.9010303@linux.vnet.ibm.com> (raw)
In-Reply-To: <4C864D65.6090004@redhat.com>

On 09/07/2010 09:34 AM, Kevin Wolf wrote:
> Am 07.09.2010 15:41, schrieb Anthony Liguori:
>    
>> Hi,
>>
>> We've got copy-on-read and image streaming working in QED and before
>> going much further, I wanted to bounce some interfaces off of the
>> libvirt folks to make sure our final interface makes sense.
>>
>> Here's the basic idea:
>>
>> Today, you can create images based on base images that are copy on
>> write.  With QED, we also support copy on read which forces a copy from
>> the backing image on read requests and write requests.
>>
>> In additional to copy on read, we introduce a notion of streaming a
>> block device which means that we search for an unallocated region of the
>> leaf image and force a copy-on-read operation.
>>
>> The combination of copy-on-read and streaming means that you can start a
>> guest based on slow storage (like over the network) and bring in blocks
>> on demand while also having a deterministic mechanism to complete the
>> transfer.
>>
>> The interface for copy-on-read is just an option within qemu-img
>> create.
>>      
> Shouldn't it be a runtime option? You can use the very same image with
> copy-on-read or copy-on-write and it will behave the same (execpt for
> performance), so it's not an inherent feature of the image file.
>    

The way it's implemented in QED is that it's a compatible feature.  This 
means that implementations are allowed to ignore it if they want to.  
It's really a suggestion.

So yes, you could have a run time switch that overrides the feature bit 
on disk and either forces copy-on-read on or off.

Do we have a way to pass block drivers run time options?

> Doing it this way has the additional advantage that you need no image
> format support for this, so we could implement copy-on-read for other
> formats, too.
>    

To do it efficiently, it really needs to be in the format for the same 
reason that copy-on-write is part of the format.

You need to understand the cluster boundaries in order to optimize the 
metadata updates.  Sure, you can expose interfaces to the block layer to 
give all of this info but that's solving the same problem for doing 
block level copy-on-write.

The other challenge is that for copy-on-read to be efficiently, you 
really need a format that can distinguish between unallocated sectors 
and zero sectors and do zero detection during the copy-on-read 
operation.  Otherwise, if you have a 10G virtual disk with a backing 
file that's 1GB is size, copy-on-read will result in the leaf being 10G 
instead of ~1GB.

>> Streaming, on the other hand, requires a bit more thought.
>> Today, I have a monitor command that does the following:
>>
>> stream<device>  <sector offset>
>>
>> Which will try to stream the minimal amount of data for a single I/O
>> operation and then return how many sectors were successfully streamed.
>>
>> The idea about how to drive this interface is a loop like:
>>
>> offset = 0;
>> while offset<  image_size:
>>      wait_for_idle_time()
>>      count = stream(device, offset)
>>      offset += count
>>
>> Obviously, the "wait_for_idle_time()" requires wide system awareness.
>> The thing I'm not sure about is 1) would libvirt want to expose a
>> similar stream interface and let management software determine idle time
>> 2) attempt to detect idle time on it's own and provide a higher level
>> interface.  If (2), the question then becomes whether we should try to
>> do this within qemu and provide libvirt a higher level interface.
>>      
> I think libvirt shouldn't have to care about sector offsets. You should
> just tell qemu to fetch the image and it should do so. We could have
> something like -drive backing_mode=[cow|cor|stream].
>    

This interface let's libvirt decide when the I/O system is idle.  The 
sector is really just a token to keep track of our overall progress.

One thing I envisioned was that a tool like virt-manager could have a 
progress bar showing the streaming progress.  It could update the 
progress bar based on (offset * 512) / image_size.

If libvirt isn't driving it, we need to detect idle I/O time and we need 
to provide an interface to query status.  Not a huge problem but I'm not 
sure that a single QEMU instance can properly detect idle I/O time.

Regards,

Anthony Liguori

>> A related topic is block migration.  Today we support pre-copy migration
>> which means we transfer the block device and then do a live migration.
>> Another approach is to do a live migration, and on the source, run a
>> block server using image streaming on the destination to move the device.
>>
>> With QED, to implement this one would:
>>
>> 1) launch qemu-nbd on the source while the guest is running
>> 2) create a qed file on the destination with copy-on-read enabled and a
>> backing file using nbd: to point to the source qemu-nbd
>> 3) run qemu -incoming on the destination with the qed file
>> 4) execute the migration
>> 5) when migration completes, begin streaming on the destination to
>> complete the copy
>> 6) when the streaming is complete, shut down the qemu-nbd instance on
>> the source
>>      
> Hm, that's an interesting idea. :-)
>
> Kevin
>    

  parent reply	other threads:[~2010-09-07 14:50 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-07 13:41 [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration Anthony Liguori
2010-09-07 14:01 ` Alexander Graf
2010-09-07 14:31   ` Anthony Liguori
2010-09-07 14:33 ` Stefan Hajnoczi
2010-09-07 14:51   ` Anthony Liguori
2010-09-07 14:55     ` Stefan Hajnoczi
2010-09-07 15:00       ` Anthony Liguori
2010-09-07 15:09         ` Stefan Hajnoczi
2010-09-07 15:20           ` Anthony Liguori
2010-09-08  8:26           ` Kevin Wolf
2010-09-07 14:34 ` Kevin Wolf
2010-09-07 14:49   ` Stefan Hajnoczi
2010-09-07 14:57     ` Anthony Liguori
2010-09-07 15:05       ` Stefan Hajnoczi
2010-09-07 15:23         ` Anthony Liguori
2010-09-12 12:41       ` Avi Kivity
2010-09-12 13:25         ` Anthony Liguori
2010-09-12 13:40           ` Avi Kivity
2010-09-12 15:23             ` Anthony Liguori
2010-09-12 16:45               ` Avi Kivity
2010-09-12 17:19                 ` Anthony Liguori
2010-09-12 17:31                   ` Avi Kivity
2010-09-07 14:49   ` Anthony Liguori [this message]
2010-09-07 15:02     ` Kevin Wolf
2010-09-07 15:11       ` Anthony Liguori
2010-09-07 15:20         ` Kevin Wolf
2010-09-07 15:30           ` Anthony Liguori
2010-09-07 15:39             ` Kevin Wolf
2010-09-07 16:00               ` Anthony Liguori
2010-09-07 15:03 ` [Qemu-devel] " Daniel P. Berrange
2010-09-07 15:16   ` Anthony Liguori
2010-09-12 10:55 ` [Qemu-devel] " Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4C86510E.9010303@linux.vnet.ibm.com \
    --to=aliguori@linux.vnet.ibm.com \
    --cc=kwolf@redhat.com \
    --cc=libvir-list@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.