From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:46939)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <wency@cn.fujitsu.com>) id 1YayBB-0000Pb-SY
	for qemu-devel@nongnu.org; Wed, 25 Mar 2015 23:09:59 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <wency@cn.fujitsu.com>) id 1YayB8-0002w0-Lp
	for qemu-devel@nongnu.org; Wed, 25 Mar 2015 23:09:57 -0400
Received: from [59.151.112.132] (port=49401 helo=heian.cn.fujitsu.com)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <wency@cn.fujitsu.com>) id 1YayB8-0002rC-0m
	for qemu-devel@nongnu.org; Wed, 25 Mar 2015 23:09:54 -0400
Message-ID: <5513793B.6020909@cn.fujitsu.com>
Date: Thu, 26 Mar 2015 11:12:59 +0800
From: Wen Congyang <wency@cn.fujitsu.com>
MIME-Version: 1.0
References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno>
In-Reply-To: <87a8z12yot.fsf@neno.neno>
Content-Type: text/plain; charset="windows-1252"
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally,
 the content of VM's memory is inconsistent between Source and
 Destination of migration
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: quintela@redhat.com, zhanghailiang <zhang.zhanghailiang@huawei.com>
Cc: hangaohuai@huawei.com, Li Zhijian <lizhijian@cn.fujitsu.com>, qemu-devel@nongnu.org, peter.huangpeng@huawei.com, "Gonglei (Arei)" <arei.gonglei@huawei.com>, Amit Shah <amit.shah@redhat.com>, "Dr. David
	Alan Gilbert (git)" <dgilbert@redhat.com>, david@gibson.dropbear.id.au

On 03/25/2015 05:50 PM, Juan Quintela wrote:
> zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:
>> Hi all,
>>
>> We found that, sometimes, the content of VM's memory is inconsistent between Source side and Destination side
>> when we check it just after finishing migration but before VM continue to Run.
>>
>> We use a patch like bellow to find this issue, you can find it from affix,
>> and Steps to reprduce:
>>
>> (1) Compile QEMU:
>>  ./configure --target-list=x86_64-softmmu  --extra-ldflags="-lssl" && make
>>
>> (2) Command and output:
>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -netdev tap,id=hn0-device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio
> 
> Could you try to reproduce:
> - without vhost
> - without virtio-net
> - cache=unsafe is going to give you trouble, but trouble should only
>   happen after migration of pages have finished.

If I use ide disk, it doesn't happen.
Even if I use virtio-net with vhost=on, it still doesn't happen. I guess
it is because I migrate the guest when it is booting. The virtio net
device is not used in this case.

Thanks
Wen Congyang

> 
> What kind of load were you having when reproducing this issue?
> Just to confirm, you have been able to reproduce this without COLO
> patches, right?
> 
>> (qemu) migrate tcp:192.168.3.8:3004
>> before saving ram complete
>> ff703f6889ab8701e4e040872d079a28
>> md_host : after saving ram complete
>> ff703f6889ab8701e4e040872d079a28
>>
>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -incoming tcp:0:3004
>> (qemu) QEMU_VM_SECTION_END, after loading ram
>> 230e1e68ece9cd4e769630e1bcb5ddfb
>> md_host : after loading all vmstate
>> 230e1e68ece9cd4e769630e1bcb5ddfb
>> md_host : after cpu_synchronize_all_post_init
>> 230e1e68ece9cd4e769630e1bcb5ddfb
>>
>> This happens occasionally, and it is more easy to reproduce when issue migration command during VM's startup time.
> 
> OK, a couple of things.  Memory don't have to be exactly identical.
> Virtio devices in particular do funny things on "post-load".  There
> aren't warantees for that as far as I know, we should end with an
> equivalent device state in memory.
> 
>> We have done further test and found that some pages has been dirtied but its corresponding migration_bitmap is not set.
>> We can't figure out which modules of QEMU has missed setting bitmap when dirty page of VM,
>> it is very difficult for us to trace all the actions of dirtying VM's pages.
> 
> This seems to point to a bug in one of the devices.
> 
>> Actually, the first time we found this problem was in the COLO FT development, and it triggered some strange issues in
>> VM which all pointed to the issue of inconsistent of VM's memory. (We have try to save all memory of VM to slave side every time
>> when do checkpoint in COLO FT, and everything will be OK.)
>>
>> Is it OK for some pages that not transferred to destination when do migration ? Or is it a bug?
> 
> Pages transferred should be the same, after device state transmission is
> when things could change.
> 
>> This issue has blocked our COLO development... :(
>>
>> Any help will be greatly appreciated!
> 
> Later, Juan.
>