From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from [140.186.70.92] (port=54741 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1PFmsJ-0006Qu-B2 for qemu-devel@nongnu.org; Tue, 09 Nov 2010 07:00:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1PFmsG-0003y7-As for qemu-devel@nongnu.org; Tue, 09 Nov 2010 07:00:29 -0500 Received: from mx1.redhat.com ([209.132.183.28]:46305) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1PFmsG-0003y0-0U for qemu-devel@nongnu.org; Tue, 09 Nov 2010 07:00:28 -0500 Date: Tue, 9 Nov 2010 14:00:20 +0200 From: "Michael S. Tsirkin" Message-ID: <20101109120020.GC22705@redhat.com> References: <20101006204546.32127.70109.stgit@s20.home> <20101108114043.GB1075@redhat.com> <1289228397.19902.18.camel@x201> <20101108165406.GE7962@redhat.com> <1289236846.28165.24.camel@x201> <20101108205901.GB10777@redhat.com> <1289251417.28165.37.camel@x201> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1289251417.28165.37.camel@x201> Subject: [Qemu-devel] Re: [PATCH 0/6] Save state error handling (kill off no_migrate) List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: cam@cs.ualberta.ca, qemu-devel@nongnu.org, kvm@vger.kernel.org, quintela@redhat.com On Mon, Nov 08, 2010 at 02:23:37PM -0700, Alex Williamson wrote: > On Mon, 2010-11-08 at 22:59 +0200, Michael S. Tsirkin wrote: > > On Mon, Nov 08, 2010 at 10:20:46AM -0700, Alex Williamson wrote: > > > On Mon, 2010-11-08 at 18:54 +0200, Michael S. Tsirkin wrote: > > > > On Mon, Nov 08, 2010 at 07:59:57AM -0700, Alex Williamson wrote: > > > > > On Mon, 2010-11-08 at 13:40 +0200, Michael S. Tsirkin wrote: > > > > > > On Wed, Oct 06, 2010 at 02:58:57PM -0600, Alex Williamson wrote: > > > > > > > Our code paths for saving or migrating a VM are full of functions that > > > > > > > return void, leaving no opportunity for a device to cancel a migration, > > > > > > > either from error or incompatibility. The ivshmem driver attempted to > > > > > > > solve this with a no_migrate flag on the save state entry. I think the > > > > > > > more generic and flexible way to solve this is to allow driver save > > > > > > > functions to fail. This series implements that and converts ivshmem > > > > > > > to uses a set_params function to NAK migration much earlier in the > > > > > > > processes. This touches a lot of files, but bulk of those changes are > > > > > > > simply s/void/int/ and tacking a "return 0" to the end of functions. > > > > > > > Thanks, > > > > > > > > > > > > > > Alex > > > > > > > > > > > > Well error handling is always tricky: it seems easier to > > > > > > require save handlers to never fail. > > > > > > > > > > Sure it's easier, but does that make it robust? > > > > > > > > More robust in the face of wwhat kind of failure? > > > > > > I really don't understand why we're having a discussion about whether > > > providing a means to return an error is a good thing or not. These > > > patches touch a lot of files, but the change is dead simple. > > > > I just don't see the motivation. Presumably your patches are > > there to achieve some kind of goal, right? I am trying to > > figure out what that goal is. > > My goal is that I want to be able to NAK a migration when devices are > assigned, and I think we can do it more generically than the no_migrate > flag so that it supports this application and any other reason that > saves might fail in the future. More generically but harder to understand and debug, IMO. > > Currently savevm callbacks never fail. So they > > return void. Why is returing 0 and adding a bunch of code to test the > > condition that never happens a good idea? It just seems to create more > > ways for devices to shoot themselves in the foot. > > And more ways to indicate something bad happened and keep running. We > already have far too many abort() calls in the code. If you can keep running why can't you migrate? > > > > > > So there's a bunch of code here but what exactly is the benefit? > > > > > > Since save handlers have no idea what does the remote do, > > > > > > what is the compatibility you mention? > > > > > > > > > > There are two users I currently have in mind. ivshmem currently makes > > > > > use of the register_device_unmigratable() because it makes use of host > > > > > specific resources and connections (aiui). This sets the no_migrate > > > > > flag, which is not dynamic and a bit of a band-aide. > > > > > The other is > > > > > device assignment, which needs a way to NAK a migration since physical > > > > > devices are never migratable. > > > > > > > > Well since all these can't be migrated ever, a fixed property actually seems > > > > a good match. Sure it's not dynamic but all the easier to debug. > > > > > > > > > I imagine we could at some point have > > > > > devices with state tied to other features that can't always be detached > > > > > from the host, this tries to provide the infrastructure for that to > > > > > happen. > > > > > > > > > > Alex > > > > > > > > Let guest control whether you can migrate? > > > > Sounds like something that is more likely to be abused > > > > than used constructively. > > > > > > s/guest/device/ So you would rather the migration failed on the > > > incoming side where it may not be detected > > > > And incoming migration handlers *must* validate the input, anyway. > > We should not plaster over this with checks on outgoing side. > > I'm not in any way suggesting incoming shouldn't do validation. So that's enough to detect the problem. > > > or it may be detected too > > > late to stop the migration? > > > > > > Alex > > > > So there's a bug and device is in an unexpected state. > > What can we do? Assert, print an error, notify guest - all these > > come to mind. But stop migration? Seems arbitrary. > > Perhaps the problem is that either an assert or an fprintf are the first > things that come to mind. We shouldn't have guests randomly blowing up > or telling users to go scan through their log files to find errors. > It's not very hard to allow simple error handling, so why shouldn't our > first plan of attack be to return an error so that the human/qmp monitor > can detect it and inform the user. For the current candidates for this > interface, there's no point notifying the guest, it's the interface > attempting to do the migration that needs to know there's something > blocking it. > > Alex I still don't understand, I am sorry. When will migration fail? Assigned devices always fail migration so it's not a good example. -- MST