Re: [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Re: [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements
       [not found]   ` <4E43D7B7.8050604@redhat.com>
@ 2011-08-11 14:11     ` Kevin Wolf
  2011-08-11 17:20       ` Eric Blake
  0 siblings, 1 reply; 3+ messages in thread
From: Kevin Wolf @ 2011-08-11 14:11 UTC (permalink / raw)
  To: Eric Blake; +Cc: libvir-list@redhat.com, Qemu-devel

[ CCed qemu-devel, just in case someone's interested ]

Am 11.08.2011 15:23, schrieb Eric Blake:
>> [ Okay, some of it is handled later in this document, but I think it's
>> still useful to leave this summary in my mail. External VM state is
>> something that you don't seem to have covered yet - can't we do this
>> already with live migration to a file? ]
> 
> Yes, external VM state is already covered with live migration to file, 
> and I will not be touching it while implementing this RFC, but future 
> extensions may be able to further unify the two concepts.

Thanks for your explanation regarding the multiple dimensions of
snapshot options, it all makes sense.

I also agree with your incremental approach. I just wanted to make sure
that we keep the possible extension in mind so that we won't end up in a
design that makes assumptions that don't hold true in the long run.

And in the long run I think that looking into unifying these features is
something that should be done, because they are really similar.

>>> libvirt can be taught to honor persistent=no for qemu by creating a
>>> qcow2 wrapper file prior to starting qemu, then tearing down that
>>> wrapper after the fact, although I'll probably leave that for later in
>>> my patch series.
>>
>> qemu can already do this with -drive snapshot=on. It must be allowed to
>> create a temporary file for this to work, of course. Is that a problem?
>> If not, I would just forward the option to qemu.
> 
> Where would that file be created?  If the main image is in a directory, 
> is the temporary would also live in that directory (shared storage 
> visible to another qemu for migration purposes) or in local storage 
> (preventing migration)?  

It uses whatever mkstemp() returns, i.e. usually something in /tmp.

> If migration is possible, would libvirt need to 
> be able to learn the name of the temporary file so as to tell the new 
> qemu on the destination the same temporary file name it should open?

That's a good point that I haven't thought of. Temporary disks isn't
something that immediately reminds me of VMs using live migration, but
there's really no reason against it. So maybe duplicating this in
libvirt could make some sense indeed.

> What about if the main image is a block device - there, the temporary 
> file obviously has to live somewhere else, but how does qemu decide 
> where, and should that decision be configurable by the user?  How will 
> things interact with SELinux labeling?  What about down the road when we 
> add enhancements to enforce that qemu cannot open() files on NFS, but 
> must instead receive fds by inheritance?

Yeah, that was basically my question, if letting qemu create a file in
/tmp would be a problem from a libvirt/SELinux perspective. Of course,
you're much more flexible if libvirt does it manually and allows to
specify where you want to create the temporary image etc.

> This certainly sounds like some fertile ground for design decisions on 
> how libvirt and qemu should interact; I don't know if -drive snapshot=on 
> is reliable enough for use by libvirt, or whether libvirt will end up 
> having to manage things itself.
> 
> Obviously, my implementation of this RFC will start simple, by rejecting 
> persistent=no for qemu, until we've answered some of those other design 
> questions; I can get snapshot_blkdev support working before we have to 
> tackle this enhancement.
> 
>>> The other thing to be aware of is that with internal snapshots, qcow2
>>> maintains a distinction between current state and a snapshot - that is,
>>> qcow2 is _always_ tracking a delta, and never modifies a named snapshot,
>>> even when you use 'qemu-img snapshot -a' to revert to different snapshot
>>> names.  But with named files, the original file now becomes a read-only
>>> backing file to a new active file; if we revert to the original file,
>>> and make any modifications to it, the active file that was using it as
>>> backing will be corrupted.  Therefore, the safest thing is to reject any
>>> attempt to revert to any snapshot (whether checkpoint or disk snapshot)
>>> that has an existing child snapshot consisting of an external disk
>>> snapshot.  The metadata for each of these children can be deleted
>>> manually, but that requires quite a few API calls (learn how many
>>> children exist, get the list of children, and for each child, get its
>>> xml to see if that child has the target snapshot as a parent, and if so
>>> delete the snapshot).  So as shorthand, virDomainRevertToSnapshot will
>>> be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which
>>> first deletes any children of the snapshot about to be deleted prior to
>>> reverting to that particular child.
>>
>> I think the API should make it possible to revert to a given external
>> snapshot without deleting all children, but by creating another qcow2
>> file that uses the same backing file. Basically this new qcow2 file
>> would be the equivalent to the "current state" concept qcow2 uses for
>> internal snapshots.
> 
> Interesting idea.  But I'm not quite sure how to fit it into existing API.
> 
> Remember, existing API is that you have an existing file name, and when 
> you call snapshot_blkdev, you are specifying a new file name that 
> becomes the live qcow2 file, rendering the previous file name as the 
> snapshot.  So:
> 
> <disk name='vda' snapshot='external'>
>    <source file='/path/to/new'/>
> </disk>
> 
> in the <domainsnapshot> is naming the new active file name, not the 
> snapshot.  If we go with that representation, then reverting to the 
> snapshot means that you want to re-create a new qcow2 file for new 
> active state, but what do we call it?  We can't call it 
> /path/to/original (since we want to reuse that as the backing file to 
> both branches in the snapshot hierarchy), and we can't call it 
> /path/to/new from the xml naming unless we get rid of the existing copy 
> of /path/to/new.  I see two options for future enhancements, but neither 
> has to be implemented right away (that is, this RFC is fine limiting the 
> reversion to a disk-snapshot to only occur when there are no 
> descendants, as long as we can later relax that restriction in the 
> future once we figure out how to do branched descendants).

Meh. I understand the problem you're describing, it just sounds so
banal. :-)

If we're taking the analogy with internal snapshots, then this "current
state" doesn't really have a name. It only gets one when you create a
new snapshot and then a new "current snapshot" is created on top.

Just that renaming external files while they are in use is probably only
a great idea if you intend to confuse everyone...

>> It should be possible to make both look the same to users if we think
>> this is a good idea.
> 
> 1. As a user, I'd much rather have an interface where _I_ decide the 
> name of the snapshot, but keep the active file name unchanged.  That is, 
> the current semantics of snapshot_blkdev feel a bit backward (it 
> requires me to tell you the name of the new active file, and the 
> existing name becomes the snapshot), where I would naively expect to 
> have a mode where I tell you the name to rename() the existing file 
> into, at which point you then recreate the original name as a active 
> qcow2 file that has the new snapshot name as its backing file.  But I'm 
> not entirely sure how that would play with SELinux permissions.  Also, 
> while rename() works for files, it is lousy for the case of the original 
> name being a block device which can't be rename()'d, so I think the 
> current snapshot_blkdev semantics are correct even if they feel a bit 
> backwards.  But it would be nice if we could design future qemu 
> enhancements that would allow the creation of a snapshot of arbitrary 
> name while keeping the live file name unchanged.

I agree with you. It feels a bit backwards for snapshots, but it's
really the only reasonable thing to do if you're using external
snapshots. That you can't rename block devices is actually a very point
point, too.

There's one more point to consider: If creating a snapshot of foo.img
just creates a new bar.img, but I keep working on foo.img, I might
expect that by deleting bar.img I remove the snapshot, but foo.img keeps
working.

So working with renames might turn out to be tricky in many ways, and
not only technical ones.

> 2. It is possible to add a new libvirt API, virDomainSnapshotCreateFrom, 
> which takes an existing snapshot as a child of the given snapshot passed 
> in as its parent.  This would combine the action of reverting to a 
> disk-snapshot along with the xml argument necessary for naming a new 
> live file, so that you could indeed support branching off the 
> disk-snapshot with a user-specified or libvirt-generated new active file 
> name without having to delete the existing children that were branched 
> off the old active file name, and making the original base file the 
> backing file to both branches.  Unfortunately, adding a new API is out 
> of the question for backporting purposes.

This API would be completely pointless with internal snapshots, right?

The ideal result would be an API where the user doesn't really have to
deal with internal vs. external snapshots other than setting the right
flag/XML option/whatever and libvirt would do the mapping to the
low-level functions.

Of course, if we want to avoid renames (for which there are good
reasons), then maybe we can't really get a unified API for internal and
external snapshots. In this case, maybe using completely different
functions to signal that we have different semantics might be appropriate.

This looks like it still needs a lot of thought.

> 2a. But thinking about it a bit more, maybe we don't need a new API, but 
> just an XML enhancement to the existing virDomainSnapshotCreateXML! 
> That is, if I specify:
> <domainsnapshot>
>    <name>branched</name>
>    <parent>
>      <name>disk-snapstho</name>
>    </parent>
>    <disks>...</disks>
> </domainsnapshot>
> 
> then we can accomplish your goal, without any qemu changes, and without 
> any new libvirt API.  That is, right now, <parent> is an output-only 
> aspect of snapshot xml, but by allowing it to be an input element 
> (probably requiring the use of a new flag, 
> VIR_DOMAIN_SNAPSHOT_CREATE_BRANCH), then it is possible to both revert 
> to the state of the old snapshot and specify the new file name to use to 
> collect the branched delta data from that point in time.  It also means 
> that creation of a branched snapshot would have to learn some of the 
> same flags as reverting to a snapshot (can you create the branch as well 
> as run a new qemu process?)  I'll play with the ideas, once I get the 
> groundwork of this RFC done first.
> 
> Thanks for forcing me to think about it!

Yes, this sounds like a nice solution for this case, and it looks
consistent with your existing proposal.

It still doesn't change anything for the fundamental problem that you
pointed me at, that internal snapshots give you different semantics than
external snapshots. So I think this is where we need some more discussion.

Kevin

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements
  2011-08-11 14:11     ` [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements Kevin Wolf
@ 2011-08-11 17:20       ` Eric Blake
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Blake @ 2011-08-11 17:20 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: libvir-list@redhat.com, Qemu-devel

On 08/11/2011 08:11 AM, Kevin Wolf wrote:
> I agree with you. It feels a bit backwards for snapshots, but it's
> really the only reasonable thing to do if you're using external
> snapshots. That you can't rename block devices is actually a very point
> point, too.
>
> There's one more point to consider: If creating a snapshot of foo.img
> just creates a new bar.img, but I keep working on foo.img, I might
> expect that by deleting bar.img I remove the snapshot, but foo.img keeps
> working.

More ideas on this front:

One of the ideas of 'live snapshot' is to grab state that I can copy to 
an independent backup, taking as much time as needed, with minimal 
interruption to qemu.  Given an original 'file' of any format, then we 
can consider the sequence:

rename file to file.tmp (assuming we figure out how to teach qemu about 
renames)
use snapshot_blkdev to recreate file with file.tmp as backup
in parallel:
   copy file.tmp to file.snapshot
   block pull the contents of file.tmp back into file
when both tasks have completed, remove file.tmp

Now, I have created a snapshot file.snap, which can safely be deleted 
without breaking 'file', and with minimal downtime to the qemu process. 
  It's just that there is a window of time where the the snapshot is 
still in progress (that is, until both file.snap and the block pull have 
completed); dealing with the wrinkle that this forces 'file' to now be 
qcow2, even if it started out raw; and dealing with rename() issues not 
being usable on block devices.  And a non-zero window of time between 
starting the sequence and reaching a stable completion implies 
ramifications to whether other commands would be locked out in the 
meantime, or whether it can be broken into multiple steps with progress 
checks along the way, whether events need to be exposed to track when 
pieces complete, and so on.

Another idea is that if qemu would ever gain a way to export the 
contents of an internal snapshot or backing file (aka external 
snapshot), independently of how that state differs from the current 
state, then another operation would be:

with qcow2 file, create an internal snapshot
use new API to copy out the snapshot state into file.snap, while qemu is 
still actively modifying current state
remove the internal snapshot

with the net result that appears the same as creating file.snap as an 
external snapshot of a given state in time, but where the original qcow2 
file is not impacted if file.snap is deleted.

>
> So working with renames might turn out to be tricky in many ways, and
> not only technical ones.

Hopefully we're leaving enough flexibility to support these additional 
snapshot modes, even if we don't implement everything in the first round.

>
>> 2. It is possible to add a new libvirt API, virDomainSnapshotCreateFrom,
>> which takes an existing snapshot as a child of the given snapshot passed
>> in as its parent.  This would combine the action of reverting to a
>> disk-snapshot along with the xml argument necessary for naming a new
>> live file, so that you could indeed support branching off the
>> disk-snapshot with a user-specified or libvirt-generated new active file
>> name without having to delete the existing children that were branched
>> off the old active file name, and making the original base file the
>> backing file to both branches.  Unfortunately, adding a new API is out
>> of the question for backporting purposes.
>
> This API would be completely pointless with internal snapshots, right?

On the contrary, it might be useful as a way to convert an internal 
snapshot into an external one.  But yes, we can already do branching 
children off internal snapshots without needing this new feature, so the 
new feature's main point is for use in creating a branching child off an 
external disk snapshot.

> The ideal result would be an API where the user doesn't really have to
> deal with internal vs. external snapshots other than setting the right
> flag/XML option/whatever and libvirt would do the mapping to the
> low-level functions.
>
> Of course, if we want to avoid renames (for which there are good
> reasons), then maybe we can't really get a unified API for internal and
> external snapshots. In this case, maybe using completely different
> functions to signal that we have different semantics might be appropriate.
>
> This looks like it still needs a lot of thought.

Different functions at the qemu level, at the libvirt level, or both?  I 
agree that the ideal libvirt semantics is a single interface with enough 
expressivity to properly map to all the underlying qemu options, where 
libvirt correctly decides between migrate to disk and qemu-img, savevm, 
snapshot_blkdev, block pull, or any other underlying operations, while 
still properly rejecting any combinations that are possible in the XML 
matrix but unsupported by current qemu capabilities.

>
>> 2a. But thinking about it a bit more, maybe we don't need a new API, but
>> just an XML enhancement to the existing virDomainSnapshotCreateXML!
>> That is, if I specify:
>> <domainsnapshot>
>>     <name>branched</name>
>>     <parent>
>>       <name>disk-snapstho</name>
>>     </parent>
>>     <disks>...</disks>
>> </domainsnapshot>
>>
>> then we can accomplish your goal, without any qemu changes, and without
>> any new libvirt API.  That is, right now,<parent>  is an output-only
>> aspect of snapshot xml, but by allowing it to be an input element
>> (probably requiring the use of a new flag,
>> VIR_DOMAIN_SNAPSHOT_CREATE_BRANCH), then it is possible to both revert
>> to the state of the old snapshot and specify the new file name to use to
>> collect the branched delta data from that point in time.  It also means
>> that creation of a branched snapshot would have to learn some of the
>> same flags as reverting to a snapshot (can you create the branch as well
>> as run a new qemu process?)  I'll play with the ideas, once I get the
>> groundwork of this RFC done first.
>>
>> Thanks for forcing me to think about it!
>
> Yes, this sounds like a nice solution for this case, and it looks
> consistent with your existing proposal.
>
> It still doesn't change anything for the fundamental problem that you
> pointed me at, that internal snapshots give you different semantics than
> external snapshots. So I think this is where we need some more discussion.

I guess at this point, my biggest concern is whether my RFC locks out 
any useful extensions, or if it still looks like we have enough 
flexibility by adding new XML constructs to cover new cases later on, 
while we wait for resolution of additional discussion on these sorts of 
internal vs. external issues.

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements
       [not found]   ` <4E53ADB5.2000804@redhat.com>
@ 2011-08-23 15:04     ` Stefan Hajnoczi
  0 siblings, 0 replies; 3+ messages in thread
From: Stefan Hajnoczi @ 2011-08-23 15:04 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Robert Wang, libvir-list@redhat.com, Jes Sorensen,
	Jagane Sundar, qemu-devel

On Tue, Aug 23, 2011 at 2:40 PM, Eric Blake <eblake@redhat.com> wrote:
> On 08/23/2011 03:52 AM, Stefan Hajnoczi wrote:
>>
>> On Wed, Aug 10, 2011 at 11:08 PM, Eric Blake<eblake@redhat.com>  wrote:
>>>
>>> disk snapshot: the state of a virtual disk used at a given time; once a
>>> snapshot exists, then it is possible to track a delta of changes that
>>> have
>>> happened since that time.
>>
>> Did you go into details of the delta API anywhere?  I don't see it.
>
> It's not available yet, because qemu doesn't provide anything yet.  I think
> that APIs to inspect the actual delta disk contents between the current
> state and a prior snapshot will be similar to block pull, but we can't
> implement anything without support from the underlying tools.

Excellent, this is an opportunity where we need to think things
through on the QEMU side and come up with a proposal that you can give
feedback on.

There is no active work in implementing a dirty block tracking API in QEMU.

We already have the bs->dirty_bitmap for block-migration.c.  Jagane
has also implemented a dirty block feature for his LiveBackup API:
https://github.com/jagane/qemu-livebackup/blob/master/livebackup.h#L95

We also have the actual bdrv_is_allocated() information to determine
whether a qcow2/qed/etc image file has the sectors allocated or not.

As a starting point we could provide a way to enable bs->dirty_bitmap
for a block device and query its status.  This is not persistent (the
bitmap is in RAM) so the question becomes whether or not to persist?
And if we persist do we want to take the cheap route of syncing the
bitmap to disk only when cleanly terminating QEMU or to do a
crash-safe bitmap?

If we specify that the dirty bitmap is not guaranteed to persist
(because it is simply an advisory feature for incremental backup and
similar applications), then we can start simple and consider doing a
persistent implementation later.

Stefan

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-08-23 15:04 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4E430148.3070506@redhat.com>
     [not found] ` <4E43A84E.4080603@redhat.com>
     [not found]   ` <4E43D7B7.8050604@redhat.com>
2011-08-11 14:11     ` [Qemu-devel] RFCv2: virDomainSnapshotCreateXML enhancements Kevin Wolf
2011-08-11 17:20       ` Eric Blake
     [not found] ` <CAJSP0QW3BR2DhYEWqThPD97UxWwoLBSZK-_2-enC=EqApSfPnA@mail.gmail.com>
     [not found]   ` <4E53ADB5.2000804@redhat.com>
2011-08-23 15:04     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).