From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:45791)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1Yb6QA-0005fM-8Y
	for qemu-devel@nongnu.org; Thu, 26 Mar 2015 07:57:59 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1Yb6Q6-0003s1-Ru
	for qemu-devel@nongnu.org; Thu, 26 Mar 2015 07:57:58 -0400
Received: from mx1.redhat.com ([209.132.183.28]:50212)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mst@redhat.com>) id 1Yb6Q6-0003rt-K7
	for qemu-devel@nongnu.org; Thu, 26 Mar 2015 07:57:54 -0400
Date: Thu, 26 Mar 2015 12:57:38 +0100
From: "Michael S. Tsirkin" <mst@redhat.com>
Message-ID: <20150326125708-mutt-send-email-mst@redhat.com>
References: <55128084.2040304@huawei.com> <87a8z12yot.fsf@neno.neno>
	<5513793B.6020909@cn.fujitsu.com> <874mp8xd9k.fsf@neno.neno>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <874mp8xd9k.fsf@neno.neno>
Subject: Re: [Qemu-devel] [Migration Bug? ] Occasionally,
 the content of VM's memory is inconsistent between Source and
 Destination of migration
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Juan Quintela <quintela@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, hangaohuai@huawei.com, zhanghailiang <zhang.zhanghailiang@huawei.com>, Li Zhijian <lizhijian@cn.fujitsu.com>, qemu-devel@nongnu.org, "Dr. David Alan Gilbert (git)" <dgilbert@redhat.com>, "Gonglei (Arei)" <arei.gonglei@huawei.com>, Stefan Hajnoczi <stefanha@redhat.com>, Amit Shah <amit.shah@redhat.com>, peter.huangpeng@huawei.com, david@gibson.dropbear.id.au

On Thu, Mar 26, 2015 at 11:29:43AM +0100, Juan Quintela wrote:
> Wen Congyang <wency@cn.fujitsu.com> wrote:
> > On 03/25/2015 05:50 PM, Juan Quintela wrote:
> >> zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:
> >>> Hi all,
> >>>
> >>> We found that, sometimes, the content of VM's memory is
> >>> inconsistent between Source side and Destination side
> >>> when we check it just after finishing migration but before VM continue to Run.
> >>>
> >>> We use a patch like bellow to find this issue, you can find it from affix,
> >>> and Steps to reprduce:
> >>>
> >>> (1) Compile QEMU:
> >>>  ./configure --target-list=x86_64-softmmu  --extra-ldflags="-lssl" && make
> >>>
> >>> (2) Command and output:
> >>> SRC: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
> >>> qemu64,-kvmclock -netdev tap,id=hn0-device
> >>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
> >>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
> >>> -device
> >>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
> >>> -monitor stdio
> >> 
> >> Could you try to reproduce:
> >> - without vhost
> >> - without virtio-net
> >> - cache=unsafe is going to give you trouble, but trouble should only
> >>   happen after migration of pages have finished.
> >
> > If I use ide disk, it doesn't happen.
> > Even if I use virtio-net with vhost=on, it still doesn't happen. I guess
> > it is because I migrate the guest when it is booting. The virtio net
> > device is not used in this case.
> 
> Kevin, Stefan, Michael, any great idea?
> 
> Thanks, Juan.

If this is during boot from disk, we can more or less
rule out virtio-net/vhost-net.

> >
> > Thanks
> > Wen Congyang
> >
> >> 
> >> What kind of load were you having when reproducing this issue?
> >> Just to confirm, you have been able to reproduce this without COLO
> >> patches, right?
> >> 
> >>> (qemu) migrate tcp:192.168.3.8:3004
> >>> before saving ram complete
> >>> ff703f6889ab8701e4e040872d079a28
> >>> md_host : after saving ram complete
> >>> ff703f6889ab8701e4e040872d079a28
> >>>
> >>> DST: # x86_64-softmmu/qemu-system-x86_64 -enable-kvm -cpu
> >>> qemu64,-kvmclock -netdev tap,id=hn0,vhost=on -device
> >>> virtio-net-pci,id=net-pci0,netdev=hn0 -boot c -drive
> >>> file=/mnt/sdb/pure_IMG/sles/sles11_sp3.img,if=none,id=drive-virtio-disk0,cache=unsafe
> >>> -device
> >>> virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
> >>> -vnc :7 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet
> >>> -monitor stdio -incoming tcp:0:3004
> >>> (qemu) QEMU_VM_SECTION_END, after loading ram
> >>> 230e1e68ece9cd4e769630e1bcb5ddfb
> >>> md_host : after loading all vmstate
> >>> 230e1e68ece9cd4e769630e1bcb5ddfb
> >>> md_host : after cpu_synchronize_all_post_init
> >>> 230e1e68ece9cd4e769630e1bcb5ddfb
> >>>
> >>> This happens occasionally, and it is more easy to reproduce when
> >>> issue migration command during VM's startup time.
> >> 
> >> OK, a couple of things.  Memory don't have to be exactly identical.
> >> Virtio devices in particular do funny things on "post-load".  There
> >> aren't warantees for that as far as I know, we should end with an
> >> equivalent device state in memory.
> >> 
> >>> We have done further test and found that some pages has been
> >>> dirtied but its corresponding migration_bitmap is not set.
> >>> We can't figure out which modules of QEMU has missed setting bitmap
> >>> when dirty page of VM,
> >>> it is very difficult for us to trace all the actions of dirtying VM's pages.
> >> 
> >> This seems to point to a bug in one of the devices.
> >> 
> >>> Actually, the first time we found this problem was in the COLO FT
> >>> development, and it triggered some strange issues in
> >>> VM which all pointed to the issue of inconsistent of VM's
> >>> memory. (We have try to save all memory of VM to slave side every
> >>> time
> >>> when do checkpoint in COLO FT, and everything will be OK.)
> >>>
> >>> Is it OK for some pages that not transferred to destination when do
> >>> migration ? Or is it a bug?
> >> 
> >> Pages transferred should be the same, after device state transmission is
> >> when things could change.
> >> 
> >>> This issue has blocked our COLO development... :(
> >>>
> >>> Any help will be greatly appreciated!
> >> 
> >> Later, Juan.
> >>