From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1O4c17-0003kX-QP for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:09 -0400 Received: from [140.186.70.92] (port=60136 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1O4c16-0003hB-6k for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1O4c12-0002mB-Fe for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:08 -0400 Received: from mx1.redhat.com ([209.132.183.28]:2047) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1O4c12-0002ld-8F for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:04 -0400 Received: from int-mx01.intmail.prod.int.phx2.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o3LFd24m028286 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 21 Apr 2010 11:39:02 -0400 From: Juan Quintela In-Reply-To: <20100421115410.5226f1dc@redhat.com> (Luiz Capitulino's message of "Wed, 21 Apr 2010 11:54:10 -0300") References: <1271797792-24571-1-git-send-email-lcapitulino@redhat.com> <1271797792-24571-5-git-send-email-lcapitulino@redhat.com> <20100420185959.31829121@redhat.com> <20100421115410.5226f1dc@redhat.com> Date: Wed, 21 Apr 2010 17:39:00 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [Qemu-devel] Re: [PATCH 04/22] savevm: do_loadvm(): Always resume the VM List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Luiz Capitulino Cc: kwolf@redhat.com, qemu-devel@nongnu.org, armbru@redhat.com Luiz Capitulino wrote: > On Wed, 21 Apr 2010 10:36:29 +0200 > Juan Quintela wrote: > >> QTAILQ_FOREACH(dinfo, &drives, next) { >> bs1 = dinfo->bdrv; >> if (bdrv_has_snapshot(bs1)) { >> >> /// We found a device that has snapshots >> ret = bdrv_snapshot_goto(bs1, name); >> if (ret < 0) { >> /// And don't have a snapshot with the name that we wanted >> switch(ret) { >> case -ENOTSUP: >> error_report("%sSnapshots not supported on device '%s'", >> bs != bs1 ? "Warning: " : "", >> bdrv_get_device_name(bs1)); >> break; >> case -ENOENT: >> error_report("%sCould not find snapshot '%s' on device '%s'", >> bs != bs1 ? "Warning: " : "", >> name, bdrv_get_device_name(bs1)); >> break; >> default: >> error_report("%sError %d while activating snapshot on '%s'", >> bs != bs1 ? "Warning: " : "", >> ret, bdrv_get_device_name(bs1)); >> break; >> } >> /* fatal on snapshot block device */ >> // I think that one inconditional exit with predjuice could be in order here >> >> // Notice that bdrv_snapshot_goto() modifies the disk, name is as bad as >> // you can get. It just open the disk, opens the snapshot, increases >> // its counter of users, and makes it available for use after here >> // (i.e. loading state, posibly conflicting with previous running >> // VM a.k.a. disk corruption. >> >> if (bs == bs1) >> return 0; >> >> // This error is as bad as it can gets :( We have to load a vmstate, >> // and the disk that should have the memory image don't have it. >> // This is an error, I just put the wrong nunmber the previous time. >> // Notice that this error should be very rare. > > So, the current code is buggy and if you fix it (by returning -1) > you'll get another bug: loadvm will stop the VM for trivial errors > like a not found image. It is not a trivial error!!!! And worse, it is not recoverable :( > How do you plan to fix this? Returning error and stoping machine. >> As stated, I don't think that trying to run the machine at any point >> would make any sense. Only case where it is safe to run it is if the >> failure is at get_bs_snapshots(), but at that point running the machine >> means: > > Actually, it must not pause the VM when recovery is (clearly) possible, > otherwise it's a usability bug for the user Monitor and a possibly serious > bug when you don't have human intervention (eg. QMP). It is not posible, we have change the device status from what was before. bets are off. we don't have a way to go back to the "safe state". >> >> $ loadvm other_image >> Error "other_image" snapshot don't exist. >> $ >> >> running the previous VM looks like something that should be done >> explicitely. If the error happened after that get_bs_snapshots(), >> We would need a new flag to just refuse to continue. Only valid >> operations at that point are other loadvm operations, i.e. our state is >> wrong one way or another. > > It's not clear to me how this flag can help, but anyway, what we need > here is: > > 1. Fail when failure is reported (vs. report a failure and return OK) This is a bug, plain an simple. > 2. Don't keep the VM paused when recovery is possible > > If you can fix that, it's ok to me: I'll drop this and the next patch. > > Otherwise I'll have to insist on the split. Re-read my email. At this point, nothing is fixable :( After doing the 1st: >> ret = bdrv_snapshot_goto(bs1, name); and not returning an error -> state has changed, period. You can't restart the machine. If you prefer, you can chang loadvm in a way that after a failure -> you can't "cont" it until you get a "working" loadvm. Later, Juan.