From: Jeff Cody <jcody@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@gmail.com>,
Zhi Hui Li <zhihuili@linux.vnet.ibm.com>,
Taisuke Yamada <tai@rakugaki.org>,
qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] CoW image commit+shrink(= make_empty) support
Date: Sat, 09 Jun 2012 12:52:12 -0400 [thread overview]
Message-ID: <4FD37F3C.9020309@redhat.com> (raw)
In-Reply-To: <4FD269D8.6000902@redhat.com>
On 06/08/2012 05:08 PM, Kevin Wolf wrote:
> Am 08.06.2012 20:33, schrieb Jeff Cody:
>> On 06/08/2012 01:57 PM, Kevin Wolf wrote:
>>> Am 08.06.2012 19:46, schrieb Jeff Cody:
>>>> On 06/08/2012 12:11 PM, Kevin Wolf wrote:
>>>>> Am 08.06.2012 16:32, schrieb Jeff Cody:
>>>>>> On 06/08/2012 09:53 AM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, Jun 8, 2012 at 2:19 PM, Jeff Cody <jcody@redhat.com> wrote:
>>>>>>>> On 06/08/2012 08:42 AM, Stefan Hajnoczi wrote:
>>>>>>>>> Note that block-commit cannot work on the top-most image since the
>>>>>>>>> guest is still writing to that image and we might never be able to
>>>>>>>>> copy all the data into the base image (the guest could write new data
>>>>>>>>> as quickly as we copy it to the base). The command should check for
>>>>>>>>> this and reject the top-most image.
>>>>>>>>
>>>>>>>> By this you mean that you would like to disallow committing the
>>>>>>>> top-level image to the base? Perhaps there is a way to attempt to
>>>>>>>> converge, and adaptively give more time to the co-routine if we are able
>>>>>>>> to detect divergence. This may require violating the 'speed' parameter,
>>>>>>>> however, and make the commit less 'live'.
>>>>>>>
>>>>>>> Yes, I think we should disallow merging down the topmost image. The
>>>>>>> convergence problem is the same issue that live migration has. It's a
>>>>>>> hard problem and not something that is essential for block-commit. It
>>>>>>> would add complexity into the block-commit implementation - the only
>>>>>>> benefit would be that you can merge down the last COW snapshot. We
>>>>>>> already have an implementation that does this: the "commit" command
>>>>>>> which stops the guest :).
>>>>>>
>>>>>> OK. And, it is better to get this implemented now, and if desired, I
>>>>>> suppose we can always revisit the convergence at a later date (or not at
>>>>>> all, if there is no pressing case for it).
>>>>>
>>>>> Not allowing live commit for the topmost image may be a good way to
>>>>> break the problem down into more manageable pieces, but eventually the
>>>>> feature isn't complete until you can do it.
>>>>>
>>>>> Basically what you need there is the equivalent of an active mirror. I
>>>>> wouldn't be surprised if actually code could be reused for both.
>>>>
>>>> I agree, doing it like mirroring for new writes on the top layer makes
>>>> sense, as long as you are willing to violate the (optional) speed
>>>> parameter. I wouldn't think violating the speed parameter in that case
>>>> would be an issue, as long as it is a documented affect of performing a
>>>> live commit on the active (top) layer.
>>>
>>> The question is what the speed parameter really means for a live commit
>>> block job (or for an active mirror). I think it would make sense to
>>> limit only the speed of I/O that doesn't come from the guest but is from
>>> the background copying.
>>>
>>> If you did apply it to guest I/O as well, then I think the logical
>>> consequence would be that guest I/O is throttled as well because
>>> requests only complete when they have reached both source and target.
>>>
>>> (I know I'm happily mixing terminology of all kinds of live block jobs
>>> here, hope you understand anyway what I mean ;-))
>>
>> I understand what you mean :) My thought is that the block commit
>> 'speed' parameter would essentially only apply to 'old data', that is
>> being copied in the commit coroutine, and it would be ignored (violated)
>> for 'new data' in the mirroring-like case.
>>
>> I assumed we wouldn't want to throttle the guest I/O, but that may be a
>> bad assumption on my part.
>
> I don't disagree, I just wanted to point out that throttling the guest
> would be another option. Applying the limit only to old data probably
> makes most sense indeed.
>
Given that, perhaps it does make sense to get it implemented in two
phases - first the straightforward block commit for images below the
active layer, and then a follow-up patch set to implement committing in
the top layer alongside an active mirroring approach for convergence.
>>>>>> I am happy to do it that way. I'll shift my focus to the atomic image
>>>>>> reopen in r/w mode. I'll go ahead and post my diagrams and other info
>>>>>> for block-commit on the wiki, because I don't think it conflicts with we
>>>>>> discussed above (although I will modify my diagrams to not show commit
>>>>>> from the top-level image). Of course, feel free to change it as
>>>>>> necessary.
>>>>>
>>>>> I may have mentioned it before, but just in case: I think Supriya's
>>>>> bdrv_reopen() patches are a good starting point. I don't know why they
>>>>> were never completed, but I think we all agreed on the general design,
>>>>> so it should be possible to pick that up.
>>>>>
>>>>> Though if you have already started with your own work on it, Jeff, I
>>>>> expect that it won't be much different because it's basically the same
>>>>> transactional approach that you know and that we already used for group
>>>>> snapshots.
>>>>
>>>> I will definitely use parts of Supriya's as it makes sense - what I
>>>> started work on is similar to bdrv_append() and the current transaction
>>>> approach, so there will be plenty in common to reuse, even with some
>>>> differences.
>>>
>>> I'm not sure how far a bdrv_append-like approach will work in this case.
>>> The problem is that you must not open any image r/w twice. The policy is
>>> that you can open an image either once r/w or multiple times read-only.
>>
>> I think there are 3 possibilities that would be OK: open once r/w, one
>> or more times r/o, or once r/w + one or more times r/o. Do you agree
>> with this? Is the latter case against block-layer policy?
>
> Yes, it is. The problem is that image formats have data cached in
> memory, and as soon as you have a writer, the other users of the same
> image file - no matter whether r/w or r/o - get out of sync. They might
> mix old (cached) matadata with other new metadata, and I think you can
> imagine what mess would result.
>
> Strictly speaking even raw caches things like the image size, so if the
> r/w instance happens to resize the image, the additional r/o readers
> would have a problem even with raw.
>
OK, thanks. I should have qualified that with saying, for this
application, that it is not intended to have the images open r/w and r/o
for ongoing operations; it is just for the purpose of overlapping the
closing of the r/o instance.
>> In this special case for block commit, we need to do the reopen() to go
>> from a r/o image to a r/w image. If an image is opened r/o, the drv
>> layer can do another open with a new descriptor, this time r/w, without
>> closing the r/o instance. If the new r/w open was successful, then the
>> r/o instance can be closed, and some bdrv_append()-like operations done.
>> Otherwise, like the current transactional code, we can just 'walk away'
>> by only closing the new image (if it got that far), and the existing BDS
>> with the r/o image is safely unchanged (and never closed, so no
>> unrecoverable error paths).
>
> Yes. It's just that you need to be very careful. You would ask the qcow2
> driver to reopen the image, and it would take care to invalidate any
> cached data and then pass the request on to the next layer. When you
> have drilled down through the stack to raw-posix, it would actually open
> the second file descriptor and wait for a commit/abort that goes through
> the whole stack again.
>
That makes sense, and I agree that it is a bit tricky. The other
question is what protocol / drivers will support this sort of operation,
as well. And I certainly think that using Supriya's patches will prove
useful.
Thanks,
Jeff
next prev parent reply other threads:[~2012-06-09 16:52 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-07 6:19 [Qemu-devel] CoW image commit+shrink(= make_empty) support Taisuke Yamada
2012-06-07 14:14 ` Jeff Cody
2012-06-08 12:42 ` Stefan Hajnoczi
2012-06-08 13:19 ` Jeff Cody
2012-06-08 13:53 ` Stefan Hajnoczi
2012-06-08 14:32 ` Jeff Cody
2012-06-08 16:11 ` Kevin Wolf
2012-06-08 17:46 ` Jeff Cody
2012-06-08 17:57 ` Kevin Wolf
2012-06-08 18:33 ` Jeff Cody
2012-06-08 21:08 ` Kevin Wolf
2012-06-09 16:52 ` Jeff Cody [this message]
2012-06-11 7:57 ` Kevin Wolf
2012-06-10 16:10 ` Paolo Bonzini
2012-06-11 7:59 ` Kevin Wolf
2012-06-11 8:01 ` Paolo Bonzini
2012-06-11 12:09 ` Stefan Hajnoczi
2012-06-11 12:50 ` Kevin Wolf
2012-06-11 14:24 ` Stefan Hajnoczi
2012-06-11 15:37 ` Jeff Cody
2012-06-11 19:12 ` Paolo Bonzini
2012-06-12 7:27 ` Zhi Hui Li
2012-06-12 10:56 ` Stefan Hajnoczi
2012-06-13 10:56 ` Supriya Kannery
2012-06-14 14:23 ` Zhi Hui Li
2012-06-14 14:29 ` Jeff Cody
2012-06-14 18:28 ` Supriya Kannery
2012-06-15 21:01 ` Supriya Kannery
2012-06-10 16:06 ` Paolo Bonzini
2012-06-08 10:39 ` Kevin Wolf
2012-06-09 11:21 ` Taisuke Yamada
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FD37F3C.9020309@redhat.com \
--to=jcody@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@gmail.com \
--cc=tai@rakugaki.org \
--cc=zhihuili@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).