From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42659)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1YLxB5-0003Yz-M9
	for qemu-devel@nongnu.org; Thu, 12 Feb 2015 12:03:53 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1YLxB2-0005p4-8n
	for qemu-devel@nongnu.org; Thu, 12 Feb 2015 12:03:47 -0500
Received: from mail-wg0-x230.google.com ([2a00:1450:400c:c00::230]:59474)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@gmail.com>) id 1YLxB1-0005ou-Uk
	for qemu-devel@nongnu.org; Thu, 12 Feb 2015 12:03:44 -0500
Received: by mail-wg0-f48.google.com with SMTP id l18so8352612wgh.7
	for <qemu-devel@nongnu.org>; Thu, 12 Feb 2015 09:03:43 -0800 (PST)
Date: Thu, 12 Feb 2015 17:03:40 +0000
From: Stefan Hajnoczi <stefanha@gmail.com>
Message-ID: <20150212170340.GG4054@stefanha-thinkpad.redhat.com>
References: <1423552846-3896-1-git-send-email-wu.wubin@huawei.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="NY6JkbSqL3W9mApi"
Content-Disposition: inline
In-Reply-To: <1423552846-3896-1-git-send-email-wu.wubin@huawei.com>
Subject: Re: [Qemu-devel] [PATCH v2] nbd: fix the co_queue multi-adding bug
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Bin Wu <wu.wubin@huawei.com>
Cc: kwolf@redhat.com, pbonzini@redhat.com, famz@redhat.com, qemu-devel@nongnu.org, stefanha@redhat.com


--NY6JkbSqL3W9mApi
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Tue, Feb 10, 2015 at 03:20:46PM +0800, Bin Wu wrote:
> From: Bin Wu <wu.wubin@huawei.com>
>=20
> When we tested the VM migartion between different hosts with NBD
> devices, we found if we sent a cancel command after the drive_mirror
> was just started, a coroutine re-enter error would occur. The stack
> was as follow:
>=20
> (gdb) bt
> 00)  0x00007fdfc744d885 in raise () from /lib64/libc.so.6
> 01)  0x00007fdfc744ee61 in abort () from /lib64/libc.so.6
> 02)  0x00007fdfca467cc5 in qemu_coroutine_enter (co=3D0x7fdfcaedb400, opa=
que=3D0x0)
> at qemu-coroutine.c:118
> 03)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=3D0x7fdfcaedb400=
) at
> qemu-coroutine-lock.c:59
> 04)  0x00007fdfca467be5 in coroutine_swap (from=3D0x7fdfcaf3c4e8,
> to=3D0x7fdfcaedb400) at qemu-coroutine.c:96
> 05)  0x00007fdfca467cea in qemu_coroutine_enter (co=3D0x7fdfcaedb400, opa=
que=3D0x0)
> at qemu-coroutine.c:123
> 06)  0x00007fdfca467f6c in qemu_co_queue_run_restart (co=3D0x7fdfcaedbdc0=
) at
> qemu-coroutine-lock.c:59
> 07)  0x00007fdfca467be5 in coroutine_swap (from=3D0x7fdfcaf3c4e8,
> to=3D0x7fdfcaedbdc0) at qemu-coroutine.c:96
> 08)  0x00007fdfca467cea in qemu_coroutine_enter (co=3D0x7fdfcaedbdc0, opa=
que=3D0x0)
> at qemu-coroutine.c:123
> 09)  0x00007fdfca4a1fa4 in nbd_recv_coroutines_enter_all (s=3D0x7fdfcaef7=
dd0) at
> block/nbd-client.c:41
> 10) 0x00007fdfca4a1ff9 in nbd_teardown_connection (client=3D0x7fdfcaef7dd=
0) at
> block/nbd-client.c:50
> 11) 0x00007fdfca4a20f0 in nbd_reply_ready (opaque=3D0x7fdfcaef7dd0) at
> block/nbd-client.c:92
> 12) 0x00007fdfca45ed80 in aio_dispatch (ctx=3D0x7fdfcae15e90) at aio-posi=
x.c:144
> 13) 0x00007fdfca45ef1b in aio_poll (ctx=3D0x7fdfcae15e90, blocking=3Dfals=
e) at
> aio-posix.c:222
> 14) 0x00007fdfca448c34 in aio_ctx_dispatch (source=3D0x7fdfcae15e90, call=
back=3D0x0,
> user_data=3D0x0) at async.c:212
> 15) 0x00007fdfc8f2f69a in g_main_context_dispatch () from
> /usr/lib64/libglib-2.0.so.0
> 16) 0x00007fdfca45c391 in glib_pollfds_poll () at main-loop.c:190
> 17) 0x00007fdfca45c489 in os_host_main_loop_wait (timeout=3D1483677098) at
> main-loop.c:235
> 18) 0x00007fdfca45c57b in main_loop_wait (nonblocking=3D0) at main-loop.c=
:484
> 19) 0x00007fdfca25f403 in main_loop () at vl.c:2249
> 20) 0x00007fdfca266fc2 in main (argc=3D42, argv=3D0x7ffff517d638,
> envp=3D0x7ffff517d790) at vl.c:4814
>=20
> We find the nbd_recv_coroutines_enter_all function (triggered by a cancel
> command or a network connection breaking down) will enter a coroutine whi=
ch
> is waiting for the sending lock. If the lock is still held by another cor=
outine,
> the entering coroutine will be added into the co_queue again. Latter, whe=
n the
> lock is released, a coroutine re-enter error will occur.
>=20
> This bug can be fixed simply by delaying the setting of recv_coroutine as
> suggested by paolo. After applying this patch, we have tested the cancel
> operation in mirror phase looply for more than 5 hous and everything is f=
ine.
> Without this patch, a coroutine re-enter error will occur in 5 minutes.
>=20
> Signed-off-by: Bn Wu <wu.wubin@huawei.com>
> ---
> v2: fix the coroutine re-enter bug in NBD code, not in coroutine infrastr=
ucture
> as suggested by paolo and kevin.
> ---
>  block/nbd-client.c | 25 +++++++++++++------------
>  1 file changed, 13 insertions(+), 12 deletions(-)

Thanks, applied to my block tree:
https://github.com/stefanha/qemu/commits/block

Stefan

--NY6JkbSqL3W9mApi
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJU3NzsAAoJEJykq7OBq3PIFw8IAKfD9Y4oVI5I119EDSkcQ/nD
4OqdUZCB7pzET03XrnjOj6fAlOvU4IWORJGPE29L3mhpcFZqYPr1DKA9BBinsbyD
rr84cbJYSXHMm2mWB44L/dqOU811YwWEvg5xYNQdpiMRtjpFxO68yq4zGhwF1jBE
82VpUBmDI1YgfqF+5KVuNNtBt/YrMGOwrYV084kMlhZg0Us7hiRTdRFU4AIb4B/F
VHB0t3pAORcabGzn/I6snsRCtrx9GxdIFTfSgoThBu5fLfuRHtzUjiYYyQLrilu5
QADdkGwyxhNY6Nhwb+/IDkex5eg+1EMoJ5t01ut64rbqlUPQ8nVdTTJljEyZc7o=
=HM32
-----END PGP SIGNATURE-----

--NY6JkbSqL3W9mApi--