From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1NDzhj-000554-6R for qemu-devel@nongnu.org; Fri, 27 Nov 2009 07:13:39 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1NDzhd-000506-Vq for qemu-devel@nongnu.org; Fri, 27 Nov 2009 07:13:38 -0500 Received: from [199.232.76.173] (port=45290 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1NDzhd-0004zs-BF for qemu-devel@nongnu.org; Fri, 27 Nov 2009 07:13:33 -0500 Received: from mx1.redhat.com ([209.132.183.28]:15298) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1NDzhc-0005mQ-II for qemu-devel@nongnu.org; Fri, 27 Nov 2009 07:13:33 -0500 Received: from int-mx05.intmail.prod.int.phx2.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.18]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id nARCDVCE008652 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Fri, 27 Nov 2009 07:13:31 -0500 From: Juan Quintela Date: Fri, 27 Nov 2009 13:12:56 +0100 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [Qemu-devel] Migration issues and possible solutions (Very long) List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org Introduction ------------ Following Dor Laor mail thread: Live migration protocol, device features, ABIs and other beasts Several of us discussed the problem and possible solutions. This mail is a summary of the thread and discussions. I am the one that summarized the discussion, but there were lots of participants, I tried to attribute the good ideas to its authors. BIAS ---- I like the idea of having Several section + whitelists and select what is the version of the device at start time. I tried to not be biased in the rest of the document, but this way you have been warned of what my bias is. Problems with current migration ------------------------------- Issue 1: Change of migration format inside a stable release ------------------------------------------------------------- Qemu savevm format allow migrating for an old release to a new release. i.e. you can migrate a machine from qemu-0.11 to qemu-0.12 if the devices versions are compatible. What is not supported in general is migrating from qemu-0.12 into qemu-0.11. But inside a stable release, we are supposed to be able to migrate back and forth i.e. from qemu-0.11.0 <-> qemu-0.11.1 What we have found is what happens if we are in the stable release and we found a bug in the savevm format? Concrete example that happened to us is that the value of two msr's were not saved in the "cpu" state. What to do here? We can: - get a new savevm format, and break the assumption that inside a stable branch you can migrate back and forth. - tell that the savevm format for a stable release is carved in stone and if it has a bug, bad luck. Now think that you have a cluster of machines, and that upgrading all of them at the same time is not an option. Notice that "both" of the solutions are bad. And we don't think that this is the last time that we have a bug in the savevm format inside a stable release. This is not academic problem, we are having this problem in RHEL with time drift and pvclock. Issue 2: Reverting to older vmstate version for compatibility ------------------------------------------------------------- In previous case, we would like to qemu-0.11.1 to be able to migrate to qemu-0.11.0 (i.e. have some way to disable the saving of the msr's, the new fields). The problem here is that qemu current refuses to load newer savevm sections because it doesn't know how to interpret them. And qemu can only migrate a device using the latest version it knows about. Issue 3: Selecting appropriate vmstate version for machine type -------------------------------------------------------------- When launching qemu-0.12 -M pc-0.11, it's desirable to be able to live migrate to a qemu 0.11.0 version. Due to Issue 2, it is not possible. This is a bug that need to be fixed in upstream qemu. Issue 4: Limitations of linear versioning ----------------------------------------- Linear versioning has problem when you have to fix things in the stable branch. device "foo" has v3 in 0.11 device "foo" has v4 in 0.12 Now a problem is found in 0.11 that requires a change in the savevm format. What version can we use for new "foo" device format? if we use v3, migration to old 0.11 will fail. If we use v4, migration to 0.12 will fail. This also happens all the time for kvm. kvm needs a new field for a device, it adds the field and increases the version number. But then at some point qemu upstream adds another field and also increases the version number. We have a conflict, and there is no way to express this kind of relations with linear versioning. This is not only a problem of downstream (kvm, Red Hat, Novell, ...), it also happens for qemu stable branches. Proposals for #1, #2, #3 ------------------------ Thinking about ways of trying to solve/mitigate this problem, there are several suggestions (not sorted in any particular order): - Dor email solution: Control *every* feature exposed to the guest by qemu command line. Obviously this is very flexible, but it has a cost of adding all the knobs and test that they work. - Anthony/Juan solution: (I don't remember who proposed it 1st, but we agreed in lot of points). We have already a mechanism that does part of this: qemu-rhel5.4.1 -M rhel5.4 This should launch qemu of 5.4.1 with a machine type of rhel5.4. But (and this is a big but) current qemu with that command line launch a machine with the devices of rhel5.4, but it uses the savevm protocol of 5.4.1. Ok, then we define that this is a bug that we have to fix. Once this is fixed, you can fix the issue 2. - Variant of previous solution (Michael Tsirkin). Do not merge machine description and savevm formats. Use a monitor command to specify in what version we want the savevm format to be. This has one advantage, you can save to different versions at any point. The reasoning form this is that in machine description should only go things that are guest visible. And savevm format is not guest visible. note from Juan: using the machine type has the advantage that at creation device type we know what version we are emulating. We can change more things that the savevm format. - Protocol negotiation (Dor, gleb, mst and eduardo at least defend part of this solution). Idea is to get source/target look for common versions that will work together. You can fix this problem in a completely different way. When you migrate you just negotiate between source and target what versions to use for each device. The possibility that I suggested (but there are more) was: source send all devices with versions ranges that it support target make a choice of the highest version numbers of each device that it support, and answer that to source (if there exist such a valid list). Gleb proposed that source just sent all possible formats, and target select the one that it understand. Anthony don't like this one, he things that this should not be part of the savevm protocol and that this should be done higher in the management stack. Info devices should print savevm options and management should find the versions beforehand. - Dor/mst proposal of optional features This came from previous discussions, Dor want to put optional fields in the savevm protocol that target can just discard. I am against this, because it makes the test cases exponential. mst always told one case, that is when the driver knows that it hasn't use a feature. Example is msix. A device can know that guest is not using msix on that device and just don't send the msix part of the information. That way you can migrate back and forth between machines were the only differences are msix support in devices. Indeed when I don't like the optional features, I agree that this idea has some merit. I think that the majority of the cases are not independent features like this one, but for features like this one it makes sense. To summarize, at this point basically all proposals agree that we want a way to select an old version of the savevm format. But the _when_ and _where_ hasn't been in agreement yet. 1) Regarding _when_ this setting may be defined: a) Defining the versions at startup b) Defining the versions at runtime c) some combination of the above d) some other option? 2) Regarding _where_ this is defined: a) machine-type b) qdev b) other config option created just for the savevm version c) monitor "set-machine-wide-savevm-version" command d) monitor "set-device-savevm-version" command e) some combination of the above f) some other option? Issue 4: Limitations of linear versioning ----------------------------------------- Possible solution: Hierarchical version numbers (Anthony proposal) It means changing the protocol to allow for two device versions, one for qemu and another for downstream (kvm/xen/...). This has some appeal, but also has its problems. What happens when a distro packages kvm, they need yet another version number, i.e. they did modifications to a device that was also modified in kvm. And it also don't solve the problem with updates (see problem 1). A device in 0.11.0 is at version 7, and now for qemu 0.12.0 is at version 8. We found a bug at 0.11.0 and we need to change the wire format, what version to use? Another possible solution: Use feature (sub)sections (Avi suggestion) Each time that we add a new set of fields to a device state, just create a new subsection for this fields. That makes easier cherry pick features for back porting to stable series. Make easy to create backward compatibility. As we are very near 0.12 and we can't implement subsections proper, Avi suggestion is to add new sections with name like: "device/feature/vendor"z That is more descriptive that hierarchical versions, and gets us quite of feasibility. This has a problem though, and that is exponential testing cases. i.e. we have device A, with features a, b, we have the following possible combinations: A A+a A+b A+a+b As you can guess, as the number of features grow, the number of test cases grow very fast. Solution for this problem is that not all combinations are valid. Only the combinations listed in the whitelist by the driver as valid will be accepted (for instance, A, A+a, A+a+b). Once that we have some kind of negotiation (be it at the savevm protocol level or upper in the management stack level), and we have some way to set the savevm version (again any of the alternatives), we will have a more flexible migration between versions, and an easier way to maintain the stable branches. A related problem is savevm/loadvm. When we do a savevm, we don't know what version would do the loadvm. And that means that we can't do a negotiation. One possible solution is to save all sections and the whitelist of valid combinations at savevm time. Now at loadvm time, we just search if there is a valid combination in the whitelist, and use only that sections, otherwise we fail the loadvm. Sorry for the long thread, but this is describing a very long thread with lots of problems/suggestions. Later, Juan.