From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42659) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLxB5-0003Yz-M9 for qemu-devel@nongnu.org; Thu, 12 Feb 2015 12:03:53 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YLxB2-0005p4-8n for qemu-devel@nongnu.org; Thu, 12 Feb 2015 12:03:47 -0500 Received: from mail-wg0-x230.google.com ([2a00:1450:400c:c00::230]:59474) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YLxB1-0005ou-Uk for qemu-devel@nongnu.org; Thu, 12 Feb 2015 12:03:44 -0500 Received: by mail-wg0-f48.google.com with SMTP id l18so8352612wgh.7 for ; Thu, 12 Feb 2015 09:03:43 -0800 (PST) Date: Thu, 12 Feb 2015 17:03:40 +0000 From: Stefan Hajnoczi Message-ID: <20150212170340.GG4054@stefanha-thinkpad.redhat.com> References: <1423552846-3896-1-git-send-email-wu.wubin@huawei.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="NY6JkbSqL3W9mApi" Content-Disposition: inline In-Reply-To: <1423552846-3896-1-git-send-email-wu.wubin@huawei.com> Subject: Re: [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Bin Wu Cc: kwolf@redhat.com, pbonzini@redhat.com, famz@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com --NY6JkbSqL3W9mApi Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Feb 10, 2015 at 03:20:46PM +0800, Bin Wu wrote: > From: Bin Wu >=20 > When we tested the VM migartion between different hosts with NBD > devices, we found if we sent a cancel command after the drive_mirror > was just started, a coroutine re-enter error would occur. The stack > was as follow: >=20 > (gdb) bt > 00) 0x00007fdfc744d885 in raise () from /lib64/libc.so.6 > 01) 0x00007fdfc744ee61 in abort () from /lib64/libc.so.6 > 02) 0x00007fdfca467cc5 in qemu_coroutine_enter (co=3D0x7fdfcaedb400, opa= que=3D0x0) > at qemu-coroutine.c:118 > 03) 0x00007fdfca467f6c in qemu_co_queue_run_restart (co=3D0x7fdfcaedb400= ) at > qemu-coroutine-lock.c:59 > 04) 0x00007fdfca467be5 in coroutine_swap (from=3D0x7fdfcaf3c4e8, > to=3D0x7fdfcaedb400) at qemu-coroutine.c:96 > 05) 0x00007fdfca467cea in qemu_coroutine_enter (co=3D0x7fdfcaedb400, opa= que=3D0x0) > at qemu-coroutine.c:123 > 06) 0x00007fdfca467f6c in qemu_co_queue_run_restart (co=3D0x7fdfcaedbdc0= ) at > qemu-coroutine-lock.c:59 > 07) 0x00007fdfca467be5 in coroutine_swap (from=3D0x7fdfcaf3c4e8, > to=3D0x7fdfcaedbdc0) at qemu-coroutine.c:96 > 08) 0x00007fdfca467cea in qemu_coroutine_enter (co=3D0x7fdfcaedbdc0, opa= que=3D0x0) > at qemu-coroutine.c:123 > 09) 0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=3D0x7fdfcaef7= dd0) at > block/nbd-client.c:41 > 10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=3D0x7fdfcaef7dd= 0) at > block/nbd-client.c:50 > 11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=3D0x7fdfcaef7dd0) at > block/nbd-client.c:92 > 12) 0x00007fdfca45ed80 in aio_dispatch (ctx=3D0x7fdfcae15e90) at aio-posi= x.c:144 > 13) 0x00007fdfca45ef1b in aio_poll (ctx=3D0x7fdfcae15e90, blocking=3Dfals= e) at > aio-posix.c:222 > 14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=3D0x7fdfcae15e90, call= back=3D0x0, > user_data=3D0x0) at async.c:212 > 15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from > /usr/lib64/libglib-2.0.so.0 > 16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190 > 17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=3D1483677098) at > main-loop.c:235 > 18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=3D0) at main-loop.c= :484 > 19) 0x00007fdfca25f403 in main_loop () at vl.c:2249 > 20) 0x00007fdfca266fc2 in main (argc=3D42, argv=3D0x7ffff517d638, > envp=3D0x7ffff517d790) at vl.c:4814 >=20 > We find the nbd_recv_coroutines_enter_all function (triggered by a cancel > command or a network connection breaking down) will enter a coroutine whi= ch > is waiting for the sending lock. If the lock is still held by another cor= outine, > the entering coroutine will be added into the co_queue again. Latter, whe= n the > lock is released, a coroutine re-enter error will occur. >=20 > This bug can be fixed simply by delaying the setting of recv_coroutine as > suggested by paolo. After applying this patch, we have tested the cancel > operation in mirror phase looply for more than 5 hous and everything is f= ine. > Without this patch, a coroutine re-enter error will occur in 5 minutes. >=20 > Signed-off-by: Bn Wu > --- > v2: fix the coroutine re-enter bug in NBD code, not in coroutine infrastr= ucture > as suggested by paolo and kevin. > --- > block/nbd-client.c | 25 +++++++++++++------------ > 1 file changed, 13 insertions(+), 12 deletions(-) Thanks, applied to my block tree: https://github.com/stefanha/qemu/commits/block Stefan --NY6JkbSqL3W9mApi Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJU3NzsAAoJEJykq7OBq3PIFw8IAKfD9Y4oVI5I119EDSkcQ/nD 4OqdUZCB7pzET03XrnjOj6fAlOvU4IWORJGPE29L3mhpcFZqYPr1DKA9BBinsbyD rr84cbJYSXHMm2mWB44L/dqOU811YwWEvg5xYNQdpiMRtjpFxO68yq4zGhwF1jBE 82VpUBmDI1YgfqF+5KVuNNtBt/YrMGOwrYV084kMlhZg0Us7hiRTdRFU4AIb4B/F VHB0t3pAORcabGzn/I6snsRCtrx9GxdIFTfSgoThBu5fLfuRHtzUjiYYyQLrilu5 QADdkGwyxhNY6Nhwb+/IDkex5eg+1EMoJ5t01ut64rbqlUPQ8nVdTTJljEyZc7o= =HM32 -----END PGP SIGNATURE----- --NY6JkbSqL3W9mApi--