From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51354)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <borntraeger@de.ibm.com>) id 1ZCmzU-0008D4-OJ
	for qemu-devel@nongnu.org; Wed, 08 Jul 2015 06:54:13 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <borntraeger@de.ibm.com>) id 1ZCmzP-00050W-4g
	for qemu-devel@nongnu.org; Wed, 08 Jul 2015 06:54:12 -0400
Received: from e06smtp16.uk.ibm.com ([195.75.94.112]:46085)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <borntraeger@de.ibm.com>) id 1ZCmzO-00050J-Rm
	for qemu-devel@nongnu.org; Wed, 08 Jul 2015 06:54:07 -0400
Received: from /spool/local
	by e06smtp16.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
	Only! Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <borntraeger@de.ibm.com>;
	Wed, 8 Jul 2015 11:54:05 +0100
Received: from b06cxnps4075.portsmouth.uk.ibm.com
	(d06relay12.portsmouth.uk.ibm.com [9.149.109.197])
	by d06dlp01.portsmouth.uk.ibm.com (Postfix) with ESMTP id B5A6617D806A
	for <qemu-devel@nongnu.org>; Wed,  8 Jul 2015 11:55:17 +0100 (BST)
Received: from d06av09.portsmouth.uk.ibm.com (d06av09.portsmouth.uk.ibm.com
	[9.149.37.250])
	by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with
	ESMTP id t68As1uq30670866
	for <qemu-devel@nongnu.org>; Wed, 8 Jul 2015 10:54:01 GMT
Received: from d06av09.portsmouth.uk.ibm.com (localhost [127.0.0.1])
	by d06av09.portsmouth.uk.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with
	ESMTP id t68As1C7020590
	for <qemu-devel@nongnu.org>; Wed, 8 Jul 2015 04:54:01 -0600
Message-ID: <559D0149.3070901@de.ibm.com>
Date: Wed, 08 Jul 2015 12:54:01 +0200
From: Christian Borntraeger <borntraeger@de.ibm.com>
MIME-Version: 1.0
References: <1436274549-28826-1-git-send-email-quintela@redhat.com>
	<1436274549-28826-16-git-send-email-quintela@redhat.com>
	<559CF767.3060000@de.ibm.com> <20150708101415.GD2463@work-vm>
	<559CFD2D.2040806@de.ibm.com> <20150708104326.GE2463@work-vm>
In-Reply-To: <20150708104326.GE2463@work-vm>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PULL 15/28] migration: create new section to
 store global state
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: amit.shah@redhat.com, Cornelia Huck <cornelia.huck@de.ibm.com>, qemu-devel@nongnu.org, Juan Quintela <quintela@redhat.com>

Am 08.07.2015 um 12:43 schrieb Dr. David Alan Gilbert:
> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>> Am 08.07.2015 um 12:14 schrieb Dr. David Alan Gilbert:
>>> * Christian Borntraeger (borntraeger@de.ibm.com) wrote:
>>>> Am 07.07.2015 um 15:08 schrieb Juan Quintela:
>>>>> This includes a new section that for now just stores the current qemu state.
>>>>>
>>>>> Right now, there are only one way to control what is the state of the
>>>>> target after migration.
>>>>>
>>>>> - If you run the target qemu with -S, it would start stopped.
>>>>> - If you run the target qemu without -S, it would run just after migration finishes.
>>>>>
>>>>> The problem here is what happens if we start the target without -S and
>>>>> there happens one error during migration that puts current state as
>>>>> -EIO.  Migration would ends (notice that the error happend doing block
>>>>> IO, network IO, i.e. nothing related with migration), and when
>>>>> migration finish, we would just "continue" running on destination,
>>>>> probably hanging the guest/corruption data, whatever.
>>>>>
>>>>> Signed-off-by: Juan Quintela <quintela@redhat.com>
>>>>> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
>>>>
>>>> This is bisected to cause a regression on s390.
>>>>
>>>> A guest restarts (booting) after managedsave/start instead of continuing.
>>>>
>>>> Do you have any idea what might be wrong?
>>>
>>> I'd add some debug to the pre_save and post_load to see what state value is
>>> being saved/restored.
>>>
>>> Also, does that regression happen when doing the save/restore using the same/latest
>>> git, or is it a load from an older version?
>>
>> Seems to happen only with some guest definitions, but I cant really pinpoint it yet.
>> e.g. removing queues='4' from my network card solved it for a reduced xml, but
>> doing the same on a bigger xml was not enough :-/
> 
> Nasty;  Still the 'paused' value in the pre-save/post-load feels right.
> I've read through the patch again and it still fells right to me, so I don't
> see anything obvious.
> 
> Perhaps it's worth turning on the migration tracing on both sides and seeing what's
> different with that 'queues=4' ?

Reducing the amount of virtio disks also seem to help. I am asking myself if
some devices use the runstate somehow and this change triggers a race.