From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1MPQgT-0003gc-95 for qemu-devel@nongnu.org; Fri, 10 Jul 2009 20:43:21 -0400 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1MPQgN-0003dv-No for qemu-devel@nongnu.org; Fri, 10 Jul 2009 20:43:19 -0400 Received: from [199.232.76.173] (port=44445 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1MPQgN-0003ds-I1 for qemu-devel@nongnu.org; Fri, 10 Jul 2009 20:43:15 -0400 Received: from mx20.gnu.org ([199.232.41.8]:55872) by monty-python.gnu.org with esmtps (TLS-1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1MPQgN-0001C2-3A for qemu-devel@nongnu.org; Fri, 10 Jul 2009 20:43:15 -0400 Received: from mail2.shareable.org ([80.68.89.115]) by mx20.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1MPQgM-0000SS-9k for qemu-devel@nongnu.org; Fri, 10 Jul 2009 20:43:14 -0400 Date: Sat, 11 Jul 2009 01:42:58 +0100 From: Jamie Lokier Subject: Re: [Qemu-devel] [PATCH 2/3] move vm stop/start to migrate_set_state Message-ID: <20090711004258.GI30322@shareable.org> References: <1247140059-5034-1-git-send-email-pbonzini@redhat.com> <1247140059-5034-3-git-send-email-pbonzini@redhat.com> <4A55F46F.6060705@codemonkey.ws> <4A55F510.5090801@redhat.com> <4A55F641.6000701@codemonkey.ws> <20090710231424.GD30322@shareable.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: malc Cc: Paolo Bonzini , qemu-devel@nongnu.org malc wrote: > > What happens if the destination host sends "migration completed", and > > then the connection drops before that message is delivered reliably to > > the sending host? > > > > The destination host will run the VM, > > and the sending host will restart and run the VM too. > > > > Two copies of the same VM running together doesn't sound healthy. > > > > This is a classic handshaking problem and I'm not aware of any perfect > > solution, only ways to ensure eventual recovery, and temporary > > uncertainty errs on the side of caution. In this case, caution would > > be neither VM running but a notification to the system manager of this > > rare condition, and the possibility to recover when the two hosts are > > able to resume communication. I don't know how to do better than that. > > Sounds like http://en.wikipedia.org/wiki/Two_Generals%27_Problem It's not the same. Unlike the Two Generals, the handshake has outcomes which allow progress with guaranteed safety. Two outcomes result in one or other machine running, and a third outcome is both machines being stopped, and repeatedly attempting to communicate for recovery. Both machines stopped is undesirable (and may be a catastrophe for some applications), but it is safe in some useful sense - it's not a disastrous failure compared with both running. Two Generals, on the other hand, doesn't have any safe solutions, except for no progress at all. There is no way for either General to proceed without some risk of failure, so the only strategy is to minimise that probability. -- Jamie there is uncertainty 1. A sends "migration complete, you start running" to B, and A stops. 2. B sends "migration complete accepted" to A, and starts running. If message 2 is lost, B will be running, A will be stopped, though A is uncertain. A defers to the system operator, or keeps trying to communicate with B. If message 1 is lost, A - > > -- > mailto:av1474@comtv.ru