From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1O4c17-0003kX-QP
	for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:09 -0400
Received: from [140.186.70.92] (port=60136 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1O4c16-0003hB-6k
	for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:09 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <quintela@redhat.com>) id 1O4c12-0002mB-Fe
	for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:08 -0400
Received: from mx1.redhat.com ([209.132.183.28]:2047)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <quintela@redhat.com>) id 1O4c12-0002ld-8F
	for qemu-devel@nongnu.org; Wed, 21 Apr 2010 11:39:04 -0400
Received: from int-mx01.intmail.prod.int.phx2.redhat.com
	(int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	by mx1.redhat.com (8.13.8/8.13.8) with ESMTP id o3LFd24m028286
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <qemu-devel@nongnu.org>; Wed, 21 Apr 2010 11:39:02 -0400
From: Juan Quintela <quintela@redhat.com>
In-Reply-To: <20100421115410.5226f1dc@redhat.com> (Luiz Capitulino's message
	of "Wed, 21 Apr 2010 11:54:10 -0300")
References: <1271797792-24571-1-git-send-email-lcapitulino@redhat.com>
	<1271797792-24571-5-git-send-email-lcapitulino@redhat.com>
	<m3mxwx3jbc.fsf@trasno.mitica> <20100420185959.31829121@redhat.com>
	<m3y6gh19te.fsf@trasno.mitica> <20100421115410.5226f1dc@redhat.com>
Date: Wed, 21 Apr 2010 17:39:00 +0200
Message-ID: <m3mxwwolwr.fsf@trasno.mitica>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Subject: [Qemu-devel] Re: [PATCH 04/22] savevm: do_loadvm(): Always resume
	the VM
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Luiz Capitulino <lcapitulino@redhat.com>
Cc: kwolf@redhat.com, qemu-devel@nongnu.org, armbru@redhat.com

Luiz Capitulino <lcapitulino@redhat.com> wrote:
> On Wed, 21 Apr 2010 10:36:29 +0200
> Juan Quintela <quintela@redhat.com> wrote:
>
>>     QTAILQ_FOREACH(dinfo, &drives, next) {
>>         bs1 = dinfo->bdrv;
>>         if (bdrv_has_snapshot(bs1)) {
>> 
>> /// We found a device that has snapshots
>>             ret = bdrv_snapshot_goto(bs1, name);
>>             if (ret < 0) {
>> /// And don't have a snapshot with the name that we wanted
>>                 switch(ret) {
>>                 case -ENOTSUP:
>>                     error_report("%sSnapshots not supported on device '%s'",
>>                                  bs != bs1 ? "Warning: " : "",
>>                                  bdrv_get_device_name(bs1));
>>                     break;
>>                 case -ENOENT:
>>                     error_report("%sCould not find snapshot '%s' on device '%s'",
>>                                  bs != bs1 ? "Warning: " : "",
>>                                  name, bdrv_get_device_name(bs1));
>>                     break;
>>                 default:
>>                     error_report("%sError %d while activating snapshot on '%s'",
>>                                  bs != bs1 ? "Warning: " : "",
>>                                  ret, bdrv_get_device_name(bs1));
>>                     break;
>>                 }
>>                 /* fatal on snapshot block device */
>> // I think that one inconditional exit with predjuice could be in order here
>> 
>> // Notice that bdrv_snapshot_goto() modifies the disk, name is as bad as
>> // you can get.  It just open the disk, opens the snapshot, increases
>> // its counter of users, and makes it available for use after here
>> // (i.e. loading state, posibly conflicting with previous running
>> // VM a.k.a. disk corruption.
>> 
>>                 if (bs == bs1)
>>                     return 0;
>> 
>> // This error is as bad as it can gets :(  We have to load a vmstate,
>> // and the disk that should have the memory image don't have it.
>> // This is an error, I just put the wrong nunmber the previous time.
>> // Notice that this error should be very rare.
>
>  So, the current code is buggy and if you fix it (by returning -1)
> you'll get another bug: loadvm will stop the VM for trivial errors
> like a not found image.

It is not a trivial error!!!!  And worse, it is not recoverable :(

>  How do you plan to fix this?

Returning error and stoping machine.

>> As stated, I don't think that trying to run the machine at any point
>> would make any sense.  Only case where it is safe to run it is if the
>> failure is at get_bs_snapshots(), but at that point running the machine
>> means:
>
>  Actually, it must not pause the VM when recovery is (clearly) possible,
> otherwise it's a usability bug for the user Monitor and a possibly serious
> bug when you don't have human intervention (eg. QMP).

It is not posible, we have change the device status from what was
before.  bets are off.  we don't have a way to go back to the "safe state".

>> <something happens>
>> $ loadvm other_image
>>   Error "other_image" snapshot don't exist.
>> $
>> 
>> running the previous VM looks like something that should be done
>> explicitely.  If the error happened after that get_bs_snapshots(),
>> We would need a new flag to just refuse to continue.  Only valid
>> operations at that point are other loadvm operations, i.e. our state is
>> wrong one way or another.
>
>  It's not clear to me how this flag can help, but anyway, what we need
> here is:
>
> 1. Fail when failure is reported (vs. report a failure and return OK)

This is a bug, plain an simple.

> 2. Don't keep the VM paused when recovery is possible
>
>  If you can fix that, it's ok to me: I'll drop this and the next patch.
>
>  Otherwise I'll have to insist on the split.

Re-read my email.   At this point, nothing is fixable :(  After doing
the 1st:

>>             ret = bdrv_snapshot_goto(bs1, name);

and not returning an error -> state has changed, period.  You can't
restart the machine.

If you prefer, you can chang loadvm in a way that after a failure -> you
can't "cont" it until you get a "working" loadvm.

Later, Juan.