From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51354) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCmzU-0008D4-OJ for qemu-devel@nongnu.org; Wed, 08 Jul 2015 06:54:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZCmzP-00050W-4g for qemu-devel@nongnu.org; Wed, 08 Jul 2015 06:54:12 -0400 Received: from e06smtp16.uk.ibm.com ([195.75.94.112]:46085) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCmzO-00050J-Rm for qemu-devel@nongnu.org; Wed, 08 Jul 2015 06:54:07 -0400 Received: from /spool/local by e06smtp16.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 8 Jul 2015 11:54:05 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (d06relay12.portsmouth.uk.ibm.com [9.149.109.197]) by d06dlp01.portsmouth.uk.ibm.com (Postfix) with ESMTP id B5A6617D806A for ; Wed, 8 Jul 2015 11:55:17 +0100 (BST) Received: from d06av09.portsmouth.uk.ibm.com (d06av09.portsmouth.uk.ibm.com [9.149.37.250]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id t68As1uq30670866 for ; Wed, 8 Jul 2015 10:54:01 GMT Received: from d06av09.portsmouth.uk.ibm.com (localhost [127.0.0.1]) by d06av09.portsmouth.uk.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id t68As1C7020590 for ; Wed, 8 Jul 2015 04:54:01 -0600 Message-ID: <559D0149.3070901@de.ibm.com> Date: Wed, 08 Jul 2015 12:54:01 +0200 From: Christian Borntraeger MIME-Version: 1.0 References: <1436274549-28826-1-git-send-email-quintela@redhat.com> <1436274549-28826-16-git-send-email-quintela@redhat.com> <559CF767.3060000@de.ibm.com> <20150708101415.GD2463@work-vm> <559CFD2D.2040806@de.ibm.com> <20150708104326.GE2463@work-vm> In-Reply-To: <20150708104326.GE2463@work-vm> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PULL 15/28] migration: create new section to store global state List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Dr. David Alan Gilbert" Cc: amit.shah@redhat.com, Cornelia Huck , qemu-devel@nongnu.org, Juan Quintela Am 08.07.2015 um 12:43 schrieb Dr. David Alan Gilbert: > * Christian Borntraeger (borntraeger@de.ibm.com) wrote: >> Am 08.07.2015 um 12:14 schrieb Dr. David Alan Gilbert: >>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote: >>>> Am 07.07.2015 um 15:08 schrieb Juan Quintela: >>>>> This includes a new section that for now just stores the current qemu state. >>>>> >>>>> Right now, there are only one way to control what is the state of the >>>>> target after migration. >>>>> >>>>> - If you run the target qemu with -S, it would start stopped. >>>>> - If you run the target qemu without -S, it would run just after migration finishes. >>>>> >>>>> The problem here is what happens if we start the target without -S and >>>>> there happens one error during migration that puts current state as >>>>> -EIO. Migration would ends (notice that the error happend doing block >>>>> IO, network IO, i.e. nothing related with migration), and when >>>>> migration finish, we would just "continue" running on destination, >>>>> probably hanging the guest/corruption data, whatever. >>>>> >>>>> Signed-off-by: Juan Quintela >>>>> Reviewed-by: Dr. David Alan Gilbert >>>> >>>> This is bisected to cause a regression on s390. >>>> >>>> A guest restarts (booting) after managedsave/start instead of continuing. >>>> >>>> Do you have any idea what might be wrong? >>> >>> I'd add some debug to the pre_save and post_load to see what state value is >>> being saved/restored. >>> >>> Also, does that regression happen when doing the save/restore using the same/latest >>> git, or is it a load from an older version? >> >> Seems to happen only with some guest definitions, but I cant really pinpoint it yet. >> e.g. removing queues='4' from my network card solved it for a reduced xml, but >> doing the same on a bigger xml was not enough :-/ > > Nasty; Still the 'paused' value in the pre-save/post-load feels right. > I've read through the patch again and it still fells right to me, so I don't > see anything obvious. > > Perhaps it's worth turning on the migration tracing on both sides and seeing what's > different with that 'queues=4' ? Reducing the amount of virtio disks also seem to help. I am asking myself if some devices use the runstate somehow and this change triggers a race.