From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:57223)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1axsaL-0005nS-Lw
	for qemu-devel@nongnu.org; Wed, 04 May 2016 04:55:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1axsa8-0002JR-SK
	for qemu-devel@nongnu.org; Wed, 04 May 2016 04:55:04 -0400
Received: from mx1.redhat.com ([209.132.183.28]:50779)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1axsa8-0002E5-LH
	for qemu-devel@nongnu.org; Wed, 04 May 2016 04:54:56 -0400
Received: from int-mx09.intmail.prod.int.phx2.redhat.com
	(int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by mx1.redhat.com (Postfix) with ESMTPS id 4D27C7F08F
	for <qemu-devel@nongnu.org>; Wed,  4 May 2016 08:54:45 +0000 (UTC)
Date: Wed, 4 May 2016 09:54:41 +0100
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20160504085440.GB2302@work-vm>
References: <1461903820-3092-1-git-send-email-eblake@redhat.com>
	<1461903820-3092-11-git-send-email-eblake@redhat.com>
	<20160503094447.GE2242@work-vm> <87inyv5nv7.fsf@dusky.pond.sub.org>
	<57289ABB.9080809@redhat.com> <20160503132757.GI2242@work-vm>
	<87vb2uz07l.fsf@dusky.pond.sub.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <87vb2uz07l.fsf@dusky.pond.sub.org>
Subject: Re: [Qemu-devel] [PATCH v3 10/18] vmstate: Use new JSON output
 visitor
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Markus Armbruster <armbru@redhat.com>
Cc: Eric Blake <eblake@redhat.com>, Amit Shah <amit.shah@redhat.com>, qemu-devel@nongnu.org, famz@redhat.com, Juan Quintela <quintela@redhat.com>

* Markus Armbruster (armbru@redhat.com) wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Eric Blake (eblake@redhat.com) wrote:
> >> On 05/03/2016 06:26 AM, Markus Armbruster wrote:
> >> 
> >> >>> +        visit_type_int(vmdesc, "size", &size, &error_abort);
> >> >>> +        visit_start_list(vmdesc, "fields", NULL, 0, &error_abort);
> >> >>> +        visit_start_struct(vmdesc, NULL, NULL, 0, &error_abort);
> >> >>
> >> >> Please avoid error_abort in migration code, especially on the source side.
> >> >> You've got an apparently happily working VM, we must never kill it 
> >> >> while attempting migration.
> >> > 
> >> > These functions cannot fail, and &error_abort is a concise way to
> >> > express that.  It's the same as
> >> > 
> >> >             visit_type_int(vmdesc, "size", &size, &err);
> >> >             assert(!err);
> >> 
> >> &error_abort is ONLY supposed to be used to flag programming errors (ie.
> >> they should never be reachable).  I'm asserting that the errors don't
> >> happen, and therefore this cannot make the migration fail - in other
> >> words, this is NOT going to kill a VM that attempts migration.
> >
> > OK, but remember that I work on the basis that there are programming errors
> > in both the migration code and the VMState descriptions for devices.
> > If those break it still shouldn't kill the source.
> > (Note this isn't just true of migration - we need to be careful about
> > it in all cases where we're doing stuff to an otherwise happy VM).
> 
> While you can safely recover from certain programming errors, you can't
> do it in general.  Worse, deciding whether recovery from a certain
> programming error is safe can be intractable.
> 
> Example: visit_type_enum(v, name, &enum_val, enum_str, &err), where v is
> an output visitor.  This can fail when enum_val is not a valid subscript
> of enum_str[].  Can we recover safely?  Assume that we can cleanly fail
> the task at hand at this point of its execution.
> 
> Perhaps enum_str[] doesn't match the actual enum.  This is a programming
> error.  Failing the task is graceful degradation, and safe enough.
> 
> But what if enum_str[] is fine, but enum_val got corrupted?  Then
> failing the task is still safe as long as enum_val isn't visible outside
> the task.  But if it is visible, all bets are off.  The corruption can
> spread, and do real damage.  Can be the difference between a crash that
> forces a reboot with a filesystem journal replay, and massive data
> corruption.
> 
> So, should we try to recover here?  Assuming we want to, badly.  If
> analysis shows the possible causes of this error are safely isolated by
> the recovery, yes.  Without such analysis, the only prudent answer is
> no.
> 
> Real world examples typically deal with state more complex than just an
> enum (all too often a thicket of pointers), and the safety argument gets
> much hairier.
> 
> If you want more tractable arguments, try Erlang.

And so my argument here is very simple; if we believe we have a corruption
in migration data then we fail migration - I don't try and do anything
clever about trying to bound what's broken.
This isn't about getting formal/tractable arguments, it's about making
a practical system.

> >> > * Conditions where the JSON output visitor itself sets an error:
> >> > 
> >> >   - None.
> >> 
> >> The JSON output visitor itself may be adding an error for an attempt to
> >> output Inf or NaN for a floating point number - but since vmstate
> >> doesn't use visit_type_number(), this is not possible.  And if we are
> >> really worried about it, then in my next spin of the patch I may make it
> >> user-configurable whether we stick to strict JSON or whether we relax
> >> things and output Inf/NaN anyways.
> >
> > If that's the only case, and you're already saying it doesn't use it, then
> > I don't see there's a point in making that bit any more configurable.
> 
> I listed all possible failures of the JSON output visitor upthread.
> This is an additional failure we've considered.  I'm wary of adding it
> precisely because I do worry about upsetting apple carts like this one.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK