Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Jeff Cody <jcody@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>,
	Zhi Hui Li <zhihuili@linux.vnet.ibm.com>,
	Taisuke Yamada <tai@rakugaki.org>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support
Date: Sat, 09 Jun 2012 12:52:12 -0400	[thread overview]
Message-ID: <4FD37F3C.9020309@redhat.com> (raw)
In-Reply-To: <4FD269D8.6000902@redhat.com>

On 06/08/2012 05:08 PM, Kevin Wolf wrote:
> Am 08.06.2012 20:33, schrieb Jeff Cody:
>> On 06/08/2012 01:57 PM, Kevin Wolf wrote:
>>> Am 08.06.2012 19:46, schrieb Jeff Cody:
>>>> On 06/08/2012 12:11 PM, Kevin Wolf wrote:
>>>>> Am 08.06.2012 16:32, schrieb Jeff Cody:
>>>>>> On 06/08/2012 09:53 AM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <jcody@redhat.com> wrote:
>>>>>>>> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote:
>>>>>>>>> Note that block-commit cannot work on the top-most image since the
>>>>>>>>> guest is still writing to that image and we might never be able to
>>>>>>>>> copy all the data into the base image (the guest could write new data
>>>>>>>>> as quickly as we copy it to the base).  The command should check for
>>>>>>>>> this and reject the top-most image.
>>>>>>>>
>>>>>>>> By this you mean that you would like to disallow committing the
>>>>>>>> top-level image to the base?  Perhaps there is a way to attempt to
>>>>>>>> converge, and adaptively give more time to the co-routine if we are able
>>>>>>>> to detect divergence.  This may require violating the 'speed' parameter,
>>>>>>>> however, and make the commit less 'live'.
>>>>>>>
>>>>>>> Yes, I think we should disallow merging down the topmost image.  The
>>>>>>> convergence problem is the same issue that live migration has.  It's a
>>>>>>> hard problem and not something that is essential for block-commit.  It
>>>>>>> would add complexity into the block-commit implementation - the only
>>>>>>> benefit would be that you can merge down the last COW snapshot.  We
>>>>>>> already have an implementation that does this: the "commit" command
>>>>>>> which stops the guest :).
>>>>>>
>>>>>> OK.  And, it is better to get this implemented now, and if desired, I
>>>>>> suppose we can always revisit the convergence at a later date (or not at
>>>>>> all, if there is no pressing case for it).
>>>>>
>>>>> Not allowing live commit for the topmost image may be a good way to
>>>>> break the problem down into more manageable pieces, but eventually the
>>>>> feature isn't complete until you can do it.
>>>>>
>>>>> Basically what you need there is the equivalent of an active mirror. I
>>>>> wouldn't be surprised if actually code could be reused for both.
>>>>
>>>> I agree, doing it like mirroring for new writes on the top layer makes
>>>> sense, as long as you are willing to violate the (optional) speed
>>>> parameter.  I wouldn't think violating the speed parameter in that case
>>>> would be an issue, as long as it is a documented affect of performing a
>>>> live commit on the active (top) layer.
>>>
>>> The question is what the speed parameter really means for a live commit
>>> block job (or for an active mirror). I think it would make sense to
>>> limit only the speed of I/O that doesn't come from the guest but is from
>>> the background copying.
>>>
>>> If you did apply it to guest I/O as well, then I think the logical
>>> consequence would be that guest I/O is throttled as well because
>>> requests only complete when they have reached both source and target.
>>>
>>> (I know I'm happily mixing terminology of all kinds of live block jobs
>>> here, hope you understand anyway what I mean ;-))
>>
>> I understand what you mean :) My thought is that the block commit
>> 'speed' parameter would essentially only apply to 'old data', that is
>> being copied in the commit coroutine, and it would be ignored (violated)
>> for 'new data' in the mirroring-like case.
>>
>> I assumed we wouldn't want to throttle the guest I/O, but that may be a
>> bad assumption on my part.
> 
> I don't disagree, I just wanted to point out that throttling the guest
> would be another option. Applying the limit only to old data probably
> makes most sense indeed.
> 

Given that, perhaps it does make sense to get it implemented in two
phases - first the straightforward block commit for images below the
active layer, and then a follow-up patch set to implement committing in
the top layer alongside an active mirroring approach for convergence.

>>>>>> I am happy to do it that way.  I'll shift my focus to the atomic image
>>>>>> reopen in r/w mode.  I'll go ahead and post my diagrams and other info
>>>>>> for block-commit on the wiki, because I don't think it conflicts with we
>>>>>> discussed above (although I will modify my diagrams to not show commit
>>>>>> from the top-level image).  Of course, feel free to change it as
>>>>>> necessary.
>>>>>
>>>>> I may have mentioned it before, but just in case: I think Supriya's
>>>>> bdrv_reopen() patches are a good starting point. I don't know why they
>>>>> were never completed, but I think we all agreed on the general design,
>>>>> so it should be possible to pick that up.
>>>>>
>>>>> Though if you have already started with your own work on it, Jeff, I
>>>>> expect that it won't be much different because it's basically the same
>>>>> transactional approach that you know and that we already used for group
>>>>> snapshots.
>>>>
>>>> I will definitely use parts of Supriya's as it makes sense - what I
>>>> started work on is similar to bdrv_append() and the current transaction
>>>> approach, so there will be plenty in common to reuse, even with some
>>>> differences.
>>>
>>> I'm not sure how far a bdrv_append-like approach will work in this case.
>>> The problem is that you must not open any image r/w twice. The policy is
>>> that you can open an image either once r/w or multiple times read-only.
>>
>> I think there are 3 possibilities that would be OK: open once r/w, one
>> or more times r/o, or once r/w + one or more times r/o.  Do you agree
>> with this?  Is the latter case against block-layer policy?
> 
> Yes, it is. The problem is that image formats have data cached in
> memory, and as soon as you have a writer, the other users of the same
> image file - no matter whether r/w or r/o - get out of sync. They might
> mix old (cached) matadata with other new metadata, and I think you can
> imagine what mess would result.
> 
> Strictly speaking even raw caches things like the image size, so if the
> r/w instance happens to resize the image, the additional r/o readers
> would have a problem even with raw.
> 

OK, thanks. I should have qualified that with saying, for this
application, that it is not intended to have the images open r/w and r/o
for ongoing operations; it is just for the purpose of overlapping the
closing of the r/o instance.

>> In this special case for block commit, we need to do the reopen() to go
>> from a r/o image to a r/w image.  If an image is opened r/o, the drv
>> layer can do another open with a new descriptor, this time r/w, without
>> closing the r/o instance.  If the new r/w open was successful, then the
>> r/o instance can be closed, and some bdrv_append()-like operations done.
>> Otherwise, like the current transactional code, we can just 'walk away'
>> by only closing the new image (if it got that far), and the existing BDS
>> with the r/o image is safely unchanged (and never closed, so no
>> unrecoverable error paths).
> 
> Yes. It's just that you need to be very careful. You would ask the qcow2
> driver to reopen the image, and it would take care to invalidate any
> cached data and then pass the request on to the next layer. When you
> have drilled down through the stack to raw-posix, it would actually open
> the second file descriptor and wait for a commit/abort that goes through
> the whole stack again.
>

That makes sense, and I agree that it is a bit tricky.  The other
question is what protocol / drivers will support this sort of operation,
as well. And I certainly think that using Supriya's patches will prove
useful.

Thanks,
Jeff

next prev parent reply	other threads:[~2012-06-09 16:52 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-06-07  6:19 [Qemu-devel] CoW image commit+shrink(= make_empty) support Taisuke Yamada
2012-06-07 14:14 ` Jeff Cody
2012-06-08 12:42   ` Stefan Hajnoczi
2012-06-08 13:19     ` Jeff Cody
2012-06-08 13:53       ` Stefan Hajnoczi
2012-06-08 14:32         ` Jeff Cody
2012-06-08 16:11           ` Kevin Wolf
2012-06-08 17:46             ` Jeff Cody
2012-06-08 17:57               ` Kevin Wolf
2012-06-08 18:33                 ` Jeff Cody
2012-06-08 21:08                   ` Kevin Wolf
2012-06-09 16:52                     ` Jeff Cody [this message]
2012-06-11  7:57                       ` Kevin Wolf
2012-06-10 16:10                 ` Paolo Bonzini
2012-06-11  7:59                   ` Kevin Wolf
2012-06-11  8:01                     ` Paolo Bonzini
2012-06-11 12:09               ` Stefan Hajnoczi
2012-06-11 12:50                 ` Kevin Wolf
2012-06-11 14:24                   ` Stefan Hajnoczi
2012-06-11 15:37                     ` Jeff Cody
2012-06-11 19:12                       ` Paolo Bonzini
2012-06-12  7:27                         ` Zhi Hui Li
2012-06-12 10:56                       ` Stefan Hajnoczi
2012-06-13 10:56                         ` Supriya Kannery
2012-06-14 14:23                       ` Zhi Hui Li
2012-06-14 14:29                         ` Jeff Cody
2012-06-14 18:28                           ` Supriya Kannery
2012-06-15 21:01                             ` Supriya Kannery
2012-06-10 16:06       ` Paolo Bonzini
2012-06-08 10:39 ` Kevin Wolf
2012-06-09 11:21   ` Taisuke Yamada

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4FD37F3C.9020309@redhat.com \
    --to=jcody@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    --cc=tai@rakugaki.org \
    --cc=zhihuili@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).