From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57223) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1axsaL-0005nS-Lw for qemu-devel@nongnu.org; Wed, 04 May 2016 04:55:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1axsa8-0002JR-SK for qemu-devel@nongnu.org; Wed, 04 May 2016 04:55:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50779) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1axsa8-0002E5-LH for qemu-devel@nongnu.org; Wed, 04 May 2016 04:54:56 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4D27C7F08F for ; Wed, 4 May 2016 08:54:45 +0000 (UTC) Date: Wed, 4 May 2016 09:54:41 +0100 From: "Dr. David Alan Gilbert" Message-ID: <20160504085440.GB2302@work-vm> References: <1461903820-3092-1-git-send-email-eblake@redhat.com> <1461903820-3092-11-git-send-email-eblake@redhat.com> <20160503094447.GE2242@work-vm> <87inyv5nv7.fsf@dusky.pond.sub.org> <57289ABB.9080809@redhat.com> <20160503132757.GI2242@work-vm> <87vb2uz07l.fsf@dusky.pond.sub.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87vb2uz07l.fsf@dusky.pond.sub.org> Subject: Re: [Qemu-devel] [PATCH v3 10/18] vmstate: Use new JSON output visitor List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Markus Armbruster Cc: Eric Blake , Amit Shah , qemu-devel@nongnu.org, famz@redhat.com, Juan Quintela * Markus Armbruster (armbru@redhat.com) wrote: > "Dr. David Alan Gilbert" writes: > > > * Eric Blake (eblake@redhat.com) wrote: > >> On 05/03/2016 06:26 AM, Markus Armbruster wrote: > >> > >> >>> + visit_type_int(vmdesc, "size", &size, &error_abort); > >> >>> + visit_start_list(vmdesc, "fields", NULL, 0, &error_abort); > >> >>> + visit_start_struct(vmdesc, NULL, NULL, 0, &error_abort); > >> >> > >> >> Please avoid error_abort in migration code, especially on the source side. > >> >> You've got an apparently happily working VM, we must never kill it > >> >> while attempting migration. > >> > > >> > These functions cannot fail, and &error_abort is a concise way to > >> > express that. It's the same as > >> > > >> > visit_type_int(vmdesc, "size", &size, &err); > >> > assert(!err); > >> > >> &error_abort is ONLY supposed to be used to flag programming errors (ie. > >> they should never be reachable). I'm asserting that the errors don't > >> happen, and therefore this cannot make the migration fail - in other > >> words, this is NOT going to kill a VM that attempts migration. > > > > OK, but remember that I work on the basis that there are programming errors > > in both the migration code and the VMState descriptions for devices. > > If those break it still shouldn't kill the source. > > (Note this isn't just true of migration - we need to be careful about > > it in all cases where we're doing stuff to an otherwise happy VM). > > While you can safely recover from certain programming errors, you can't > do it in general. Worse, deciding whether recovery from a certain > programming error is safe can be intractable. > > Example: visit_type_enum(v, name, &enum_val, enum_str, &err), where v is > an output visitor. This can fail when enum_val is not a valid subscript > of enum_str[]. Can we recover safely? Assume that we can cleanly fail > the task at hand at this point of its execution. > > Perhaps enum_str[] doesn't match the actual enum. This is a programming > error. Failing the task is graceful degradation, and safe enough. > > But what if enum_str[] is fine, but enum_val got corrupted? Then > failing the task is still safe as long as enum_val isn't visible outside > the task. But if it is visible, all bets are off. The corruption can > spread, and do real damage. Can be the difference between a crash that > forces a reboot with a filesystem journal replay, and massive data > corruption. > > So, should we try to recover here? Assuming we want to, badly. If > analysis shows the possible causes of this error are safely isolated by > the recovery, yes. Without such analysis, the only prudent answer is > no. > > Real world examples typically deal with state more complex than just an > enum (all too often a thicket of pointers), and the safety argument gets > much hairier. > > If you want more tractable arguments, try Erlang. And so my argument here is very simple; if we believe we have a corruption in migration data then we fail migration - I don't try and do anything clever about trying to bound what's broken. This isn't about getting formal/tractable arguments, it's about making a practical system. > >> > * Conditions where the JSON output visitor itself sets an error: > >> > > >> > - None. > >> > >> The JSON output visitor itself may be adding an error for an attempt to > >> output Inf or NaN for a floating point number - but since vmstate > >> doesn't use visit_type_number(), this is not possible. And if we are > >> really worried about it, then in my next spin of the patch I may make it > >> user-configurable whether we stick to strict JSON or whether we relax > >> things and output Inf/NaN anyways. > > > > If that's the only case, and you're already saying it doesn't use it, then > > I don't see there's a point in making that bit any more configurable. > > I listed all possible failures of the JSON output visitor upthread. > This is an additional failure we've considered. I'm wary of adding it > precisely because I do worry about upsetting apple carts like this one. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK