From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43171) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b51q1-0001V1-MM for qemu-devel@nongnu.org; Mon, 23 May 2016 22:12:55 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b51pz-0006ze-97 for qemu-devel@nongnu.org; Mon, 23 May 2016 22:12:52 -0400 Date: Tue, 24 May 2016 10:12:44 +0800 From: Fam Zheng Message-ID: <20160524021244.GD14601@ad.usersys.redhat.com> References: <574351CE.8000605@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <574351CE.8000605@linux.vnet.ibm.com> Subject: Re: [Qemu-devel] coroutines: block: Co-routine re-entered recursively when migrating disk with iothreads List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: "Jason J. Herne" Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, stefanha@redhat.com, jcody@redhat.com, quintela@redhat.com On Mon, 05/23 14:54, Jason J. Herne wrote: > Using libvirt to migrate a guest and one guest disk that is using iothreads > causes Qemu to crash with the message: > Co-routine re-entered recursively > > I've looked into this one a bit but I have not seen anything that > immediately stands out. > Here is what I have found: > > In qemu_coroutine_enter: > if (co->caller) { > fprintf(stderr, "Co-routine re-entered recursively\n"); > abort(); > } > > The value of co->caller is actually changing between the time "if > (co->caller)" is evaluated and the time I print some debug statements > directly under the existing fprintf. I confirmed this by saving the value in > a local variable and printing both the new local variable and co->caller > immediately after the existing fprintf. This would certainly indicate some > kind of concurrency issue. However, it does not necessarily point to the > reason we ended up inside this if statement because co->caller was not NULL > before it was trashed. Perhaps it was trashed more than once then? I figured > maybe the problem was with coroutine pools so I disabled them > (--disable-coroutine-pool) and still hit the bug. Which coroutine backend are you using? > > The backtrace is not always identical. Here is one instance: > (gdb) bt > #0 0x000003ffa78be2c0 in raise () from /lib64/libc.so.6 > #1 0x000003ffa78bfc26 in abort () from /lib64/libc.so.6 > #2 0x0000000080427d80 in qemu_coroutine_enter (co=0xa2cf2b40, opaque=0x0) > at /root/kvmdev/qemu/util/qemu-coroutine.c:112 > #3 0x000000008032246e in nbd_restart_write (opaque=0xa2d0cd40) at > /root/kvmdev/qemu/block/nbd-client.c:114 > #4 0x00000000802b3a1c in aio_dispatch (ctx=0xa2c907a0) at > /root/kvmdev/qemu/aio-posix.c:341 > #5 0x00000000802b4332 in aio_poll (ctx=0xa2c907a0, blocking=true) at > /root/kvmdev/qemu/aio-posix.c:479 > #6 0x0000000080155aba in iothread_run (opaque=0xa2c90260) at > /root/kvmdev/qemu/iothread.c:46 > #7 0x000003ffa7a87c2c in start_thread () from /lib64/libpthread.so.0 > #8 0x000003ffa798ec9a in thread_start () from /lib64/libc.so.6 It may be worth looking at backtrace of all threads especially the monitor thread (main thread). > > I've also noticed that co->entry sometimes (maybe always?) points to > mirror_run. Though, given that co->caller changes unexpectedly I don't know > if we can trust co->entry. > > I do not see the bug when I perform the same migration without migrating the > disk. > I also do not see the bug when I remove the iothread from the guest. > > I tested this scenario as far back as tag v2.4.0 and hit the bug every time. > I was unable to test v2.3.0 due to unresolved guest hangs. I did, however, > manage to get as far as this commit: > > commit ca96ac44dcd290566090b2435bc828fded356ad9 > Author: Stefan Hajnoczi > Date: Tue Jul 28 18:34:09 2015 +0200 > AioContext: force event loop iteration using BH > > This commit fixes a hang that my test scenario experiences. I was able to > test even further back by cherry-picking ca96ac44 on top of the earlier > commits but at this point I cannot be sure if the bug was introduced by > ca96ac44 so I stopped. > > I am willing to run tests or collect any info needed. I'll keep > investigating but I won't turn down any help :). > > Qemu command line as taken from Libvirt log: > qemu-system-s390x > -name kvm1 -S -machine s390-ccw-virtio-2.6,accel=kvm,usb=off > -m 6144 -realtime mlock=off > -smp 1,sockets=1,cores=1,threads=1 > -object iothread,id=iothread1 > -uuid 3796d9f0-8555-4a1e-9d5c-fac56b8cbf56 > -nographic -no-user-config -nodefaults > -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-kvm1/monitor.sock,server,nowait > -mon chardev=charmonitor,id=monitor,mode=control > -rtc base=utc -no-shutdown > -boot strict=on -kernel /data/vms/kvm1/kvm1-image > -initrd /data/vms/kvm1/kvm1-initrd -append 'hvc_iucv=8 TERM=dumb' > -drive file=/dev/disk/by-path/ccw-0.0.c22b,format=raw,if=none,id=drive-virtio-disk0,cache=none > -device virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 > -drive file=/data/vms/kvm1/kvm1.qcow,format=qcow2,if=none,id=drive-virtio-disk1,cache=none > -device virtio-blk-ccw,iothread=iothread1,scsi=off,devno=fe.0.0008,drive=drive-virtio-disk1,id=virtio-disk1 > -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 > -device > virtio-net-ccw,netdev=hostnet0,id=net0,mac=52:54:00:c9:86:2b,devno=fe.0.0001 > -chardev pty,id=charconsole0 -device > sclpconsole,chardev=charconsole0,id=console0 > -device virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on > > Libvirt migration command: > virsh migrate --live --persistent --copy-storage-all --migrate-disks vdb > kvm1 qemu+ssh://dev1/system > > -- > -- Jason J. Herne (jjherne@linux.vnet.ibm.com) >