From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Cooper Subject: Re: [Patch v2] tools/migrate: Fix regression when migrating from older version of Xen Date: Mon, 22 Jul 2013 19:06:39 +0100 Message-ID: <51ED74AF.9010605@citrix.com> References: <1373971931-7211-1-git-send-email-andrew.cooper3@citrix.com> <51E8176D.5030809@citrix.com> <51ED7143.6090006@citrix.com> <1374515941.6623.30.camel@hastur.hellion.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1374515941.6623.30.camel@hastur.hellion.org.uk> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: rshriram@cs.ubc.ca, Xen-devel List-Id: xen-devel@lists.xenproject.org On 22/07/13 18:59, Ian Campbell wrote: > On Mon, 2013-07-22 at 18:52 +0100, Andrew Cooper wrote: >> On 18/07/13 19:47, Shriram Rajagopalan wrote: >> >>> On Thu, Jul 18, 2013 at 12:27 PM, Andrew Cooper >>> wrote: >>> On 18/07/13 17:20, Shriram Rajagopalan wrote: >>> >>> > On Tue, Jul 16, 2013 at 6:52 AM, Andrew Cooper >>> > wrote: >>> > and you setting ctx->last_checkpoint = 0 basically means >>> > that you are >>> > banking on the far end (with older version of tools) to >>> > close the socket, causing >>> > "get pagebuf" to fail and subsequently land on finish. >>> > IIRC, this was the behavior before >>> > XC_SAVE_ID_LAST_CHECKPOINT was introduced >>> > by Ian Campbell, to get rid of this benign error message. >>> > >>> >>> >>> That might have been 'fine' for PV guests, but it cause >>> active breakage for HVM domains where the qemu save record >>> immediately follows in the migration stream. >>> >>> >>> >>> >>> Just to clarify.. the code flows like this, iirc. >>> >>> >>> loadpages: >>> while (1) >>> if !completed >>> get pagebufs >>> if 0 pages sent, break >>> endif >>> apply batch (pagebufs) >>> endwhile >>> >>> >>> if !completed >>> get tailbuf [[this is where the QEMU record would be obtained]] >>> completed = 1 >>> endif >>> >>> >>> if last_checkpoint >>> goto finish >>> endif >>> >>> >>> get pagebuf or goto finish on error ---> this is where old code >>> used to exit >>> get tailbuf >>> goto loadpages >>> finish: >>> apply tailbuf [tailbuf obtained inside the 'if !completed' block] >>> do the rest of the restore >> (Logically joining the two divergent threads, as this is the answer to >> both) >> >> This has nothing to do with the buffering mode, and that is not where >> the Qemu record would be obtained from. >> >> As the code currently stands, if a XC_SAVE_ID_LAST_CHECKPOINT chunk is >> not seen in the stream, we complete the loadpages: section, including >> the magic pages (TSS, console info etc). >> >> We then read the buffer_tail() on line 171, set ctx->completed on line >> 1725, but fail the ctx->last_checkpoint check on line 1758. >> >> What we should do is pass the last_checkpoint test, and goto finish >> which then calls dump_qemu(). What actually happens is a call to >> pagebuf_get() on line 1766 which raises an error because of finding a >> Qemu save record rather than more pages. >> >> So this is very much a bug directly introduced by 00a4b65f85, and can >> only be fixed with knowledge from the higher levels of the toolstack. >> >> As for the wording of the parameter, I still prefer the original >> "last_checkpoint_unaware" over "checkpointed_stream" as it is more >> accurate. >> >> Any migration stream started from a version of the tools after c/s >> 00a4b65f85 will work, whether it is checkpointed or not (as the >> XC_SAVE_ID_LAST_CHECKPOINT chunk will be sent in the correct place). >> Any migration started from a version of the tools before c/s >> 00a4b65f85 will fail because it has no idea that the receiving end is >> expecting it to insert XC_SAVE_ID_LAST_CHECKPOINT chunk. The fault >> here is with the receiving end expecting to find a >> XC_SAVE_ID_LAST_CHECKPOINT chunk. >> >> The only fix is for newer toolstack to be aware that the migration >> stream is from an older toolstack, and to set >> last_checkpoint_unaware=1, which will set ctx->last_checkpoint to 1, >> allowing the receiving side of the migration to correctly read the >> migration stream. > The toolstack cannot in general know that (i.e. how does xl know?) > > It does know if it is doing checkpointing or not though. > > There are four cases: > > Regular migrate from pre-00a4b65f85 to post-00a4b65f85. Receiver sets > last_checkpoint = 1 at start of day. Doesn't care that sender never > sends XC_SAVE_ID_LAST_CHECKPOINT. > > Regular migrate from post-00a4b65f85 to post-00a4b65f85. Receiver sets > last_checkpoint = 1 at start of day. Doesn't care that it sees > XC_SAVE_ID_LAST_CHECKPOINT again, just ends up (pointlessly) setting > last_checkpoint = 1. > > Checkpointed migrate from post-00a4b65f85 to post-00a4b65f85. Receiver > doesn't set last_checkpoint = 1 at start of day. > XC_SAVE_ID_LAST_CHECKPOINT is sent at the right time and both sides > agree etc. Things work. > > Checkpointed migrate from pre-00a4b65f85 to post-00a4b65f85. We don't > support this, because checkpoints should only be supported between like > versions of Xen (N->N+1 case doesn't apply). > > Ian. > > Ok - that seems to make it easier. I will see about plumbing the correct information through from xend/remus. ~Andrew