From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43171)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1b51q1-0001V1-MM
	for qemu-devel@nongnu.org; Mon, 23 May 2016 22:12:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1b51pz-0006ze-97
	for qemu-devel@nongnu.org; Mon, 23 May 2016 22:12:52 -0400
Date: Tue, 24 May 2016 10:12:44 +0800
From: Fam Zheng <famz@redhat.com>
Message-ID: <20160524021244.GD14601@ad.usersys.redhat.com>
References: <574351CE.8000605@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <574351CE.8000605@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] coroutines: block: Co-routine re-entered
 recursively when migrating disk with iothreads
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, stefanha@redhat.com, jcody@redhat.com, quintela@redhat.com

On Mon, 05/23 14:54, Jason J. Herne wrote:
> Using libvirt to migrate a guest and one guest disk that is using iothreads
> causes Qemu to crash with the message:
> Co-routine re-entered recursively
> 
> I've looked into this one a bit but I have not seen anything that
> immediately stands out.
> Here is what I have found:
> 
> In qemu_coroutine_enter:
>     if (co->caller) {
>         fprintf(stderr, "Co-routine re-entered recursively\n");
>         abort();
>     }
> 
> The value of co->caller is actually changing between the time "if
> (co->caller)" is evaluated and the time I print some debug statements
> directly under the existing fprintf. I confirmed this by saving the value in
> a local variable and printing both the new local variable and co->caller
> immediately after the existing fprintf. This would certainly indicate some
> kind of concurrency issue. However, it does not necessarily point to the
> reason we ended up inside this if statement because co->caller was not NULL
> before it was trashed. Perhaps it was trashed more than once then? I figured
> maybe the problem was with coroutine pools so I disabled them
> (--disable-coroutine-pool) and still hit the bug.

Which coroutine backend are you using?

> 
> The backtrace is not always identical. Here is one instance:
> (gdb) bt
> #0  0x000003ffa78be2c0 in raise () from /lib64/libc.so.6
> #1  0x000003ffa78bfc26 in abort () from /lib64/libc.so.6
> #2  0x0000000080427d80 in qemu_coroutine_enter (co=0xa2cf2b40, opaque=0x0)
> at /root/kvmdev/qemu/util/qemu-coroutine.c:112
> #3  0x000000008032246e in nbd_restart_write	 (opaque=0xa2d0cd40) at
> /root/kvmdev/qemu/block/nbd-client.c:114
> #4  0x00000000802b3a1c in aio_dispatch (ctx=0xa2c907a0) at
> /root/kvmdev/qemu/aio-posix.c:341
> #5  0x00000000802b4332 in aio_poll (ctx=0xa2c907a0, blocking=true) at
> /root/kvmdev/qemu/aio-posix.c:479
> #6  0x0000000080155aba in iothread_run (opaque=0xa2c90260) at
> /root/kvmdev/qemu/iothread.c:46
> #7  0x000003ffa7a87c2c in start_thread () from /lib64/libpthread.so.0
> #8  0x000003ffa798ec9a in thread_start () from /lib64/libc.so.6

It may be worth looking at backtrace of all threads especially the monitor
thread (main thread).

> 
> I've also noticed that co->entry sometimes (maybe always?) points to
> mirror_run. Though, given that co->caller changes unexpectedly I don't know
> if we can trust co->entry.
> 
> I do not see the bug when I perform the same migration without migrating the
> disk.
> I also do not see the bug when I remove the iothread from the guest.
> 
> I tested this scenario as far back as tag v2.4.0 and hit the bug every time.
> I was unable to test v2.3.0 due to unresolved guest hangs. I did, however,
> manage to get as far as this commit:
> 
> commit ca96ac44dcd290566090b2435bc828fded356ad9
> Author: Stefan Hajnoczi <stefanha@redhat.com>
> Date:   Tue Jul 28 18:34:09 2015 +0200
> AioContext: force event loop iteration using BH
> 
> This commit fixes a hang that my test scenario experiences. I was able to
> test even further back by cherry-picking ca96ac44 on top of the earlier
> commits but at this point I cannot be sure if the bug was introduced by
> ca96ac44 so I stopped.
> 
> I am willing to run tests or collect any info needed. I'll keep
> investigating but I won't turn down any help :).
> 
> Qemu command line as taken from Libvirt log:
> qemu-system-s390x
>     -name kvm1 -S -machine s390-ccw-virtio-2.6,accel=kvm,usb=off
>     -m 6144 -realtime mlock=off
>     -smp 1,sockets=1,cores=1,threads=1
>     -object iothread,id=iothread1
>     -uuid 3796d9f0-8555-4a1e-9d5c-fac56b8cbf56
>     -nographic -no-user-config -nodefaults
>     -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-kvm1/monitor.sock,server,nowait
>     -mon chardev=charmonitor,id=monitor,mode=control
>     -rtc base=utc -no-shutdown
>     -boot strict=on -kernel /data/vms/kvm1/kvm1-image
>     -initrd /data/vms/kvm1/kvm1-initrd -append 'hvc_iucv=8 TERM=dumb'
>     -drive file=/dev/disk/by-path/ccw-0.0.c22b,format=raw,if=none,id=drive-virtio-disk0,cache=none
>     -device virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
>     -drive file=/data/vms/kvm1/kvm1.qcow,format=qcow2,if=none,id=drive-virtio-disk1,cache=none
>     -device virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0008,drive=drive-virtio-disk1,id=virtio-disk1
>     -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27
>     -device
> virtio-net-ccw,netdev=hostnet0,id=net0,mac=52:54:00:c9:86:2b,devno=fe.0.0001
>     -chardev pty,id=charconsole0 -device
> sclpconsole,chardev=charconsole0,id=console0
>     -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on
> 
> Libvirt migration command:
> virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb
> kvm1 qemu+ssh://dev1/system
> 
> -- 
> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
>