From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48028) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YL7oy-0000il-Be for qemu-devel@nongnu.org; Tue, 10 Feb 2015 05:13:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YL7ou-0004sk-7D for qemu-devel@nongnu.org; Tue, 10 Feb 2015 05:13:32 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52225) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YL7ot-0004sW-Vi for qemu-devel@nongnu.org; Tue, 10 Feb 2015 05:13:28 -0500 Date: Tue, 10 Feb 2015 11:13:18 +0100 From: Kevin Wolf Message-ID: <20150210101318.GB5202@noname.str.redhat.com> References: <1423464639-4764-1-git-send-email-wu.wubin@huawei.com> <20150209144813.GA2076@stefanha-thinkpad.redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="u3/rZRmxL6MmkK24" Content-Disposition: inline In-Reply-To: <20150209144813.GA2076@stefanha-thinkpad.redhat.com> Subject: Re: [Qemu-devel] [PATCH v2] qemu-coroutine: segfault when restarting co_queue List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: famz@redhat.com, stefanha@redhat.com, subo7@huawei.com, kathy.wangting@huawei.com, bruce.fon@huawei.com, qemu-devel@nongnu.org, arei.gonglei@huawei.com, boby.chen@huawei.com, pbonzini@redhat.com, rudy.zhangmin@huawei.com, Bin Wu --u3/rZRmxL6MmkK24 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Am 09.02.2015 um 15:48 hat Stefan Hajnoczi geschrieben: > On Mon, Feb 09, 2015 at 02:50:39PM +0800, Bin Wu wrote: > > From: Bin Wu > >=20 > > We tested VMs migration with their disk images by drive_mirror. With > > migration, two VMs copyed large files between each other. During the > > test, a segfault occured. The stack was as follow: > >=20 > > (gdb) bt > > qemu-coroutine-lock.c:66 > > to=3D0x7fa5a1798648) at qemu-coroutine.c:97 > > request=3D0x7fa28c2ffa10, reply=3D0x7fa28c2ffa30, qiov=3D0x0, offset=3D= 0) at > > block/nbd-client.c:165 > > sector_num=3D8552704, nb_sectors=3D2040, qiov=3D0x7fa5a1757468, offset= =3D0) at > > block/nbd-client.c:262 > > sector_num=3D8552704, nb_sectors=3D2048, qiov=3D0x7fa5a1757468) at > > block/nbd-client.c:296 > > nb_sectors=3D2048, qiov=3D0x7fa5a1757468) at block/nbd.c:291 > > req=3D0x7fa28c2ffbb0, offset=3D4378984448, bytes=3D1048576, qiov=3D0x7f= a5a1757468, > > flags=3D0) at block.c:3321 > > offset=3D4378984448, bytes=3D1048576, qiov=3D0x7fa5a1757468, flags=3D(u= nknown: 0)) at > > block.c:3447 > > sector_num=3D8552704, nb_sectors=3D2048, qiov=3D0x7fa5a1757468, flags= =3D(unknown: 0)) at > > block.c:3471 > > nb_sectors=3D2048, qiov=3D0x7fa5a1757468) at block.c:3480 > > nb_sectors=3D2048, qiov=3D0x7fa5a1757468) at block/raw_bsd.c:62 > > req=3D0x7fa28c2ffe30, offset=3D4378984448, bytes=3D1048576, qiov=3D0x7f= a5a1757468, > > flags=3D0) at block.c:3321 > > offset=3D4378984448, bytes=3D1048576, qiov=3D0x7fa5a1757468, flags=3D(u= nknown: 0)) at > > block.c:3447 > > sector_num=3D8552704, nb_sectors=3D2048, qiov=3D0x7fa5a1757468, flags= =3D(unknown: 0)) at > > block.c:3471 > > coroutine-ucontext.c:121 >=20 > This backtrace is incomplete. Where are the function names? The > parameter lists appear incomplete too. >=20 > > After analyzing the stack and reviewing the code, we find the > > qemu_co_queue_run_restart should not be put in the coroutine_swap funct= ion which > > can be invoked by qemu_coroutine_enter or qemu_coroutine_yield. Only > > qemu_coroutine_enter needs to restart the co_queue. > >=20 > > The error scenario is as follow: coroutine C1 enters C2, C2 yields > > back to C1, then C1 ternimates and the related coroutine memory > > becomes invalid. After a while, the C2 coroutine is entered again. > > At this point, C1 is used as a parameter passed to > > qemu_co_queue_run_restart. Therefore, qemu_co_queue_run_restart > > accesses an invalid memory and a segfault error ocurrs. > >=20 > > The qemu_co_queue_run_restart function re-enters coroutines waiting > > in the co_queue. However, this function should be only used int the > > qemu_coroutine_enter context. Only in this context, when the current > > coroutine gets execution control again(after the execution of > > qemu_coroutine_switch), we can restart the target coutine because the > > target coutine has yielded back to the current coroutine or it has > > terminated. > >=20 > > First we want to put qemu_co_queue_run_restart in qemu_coroutine_enter, > > but we find we can not access the target coroutine if it terminates. >=20 > This example captures the scenario you describe: >=20 > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > index 525247b..883cbf5 100644 > --- a/qemu-coroutine.c > +++ b/qemu-coroutine.c > @@ -103,7 +103,10 @@ static void coroutine_swap(Coroutine *from, Coroutin= e *to) > { > CoroutineAction ret; > =20 > + fprintf(stderr, "> %s from %p to %p\n", __func__, from, to); > ret =3D qemu_coroutine_switch(from, to, COROUTINE_YIELD); > + fprintf(stderr, "< %s from %p to %p switch %s\n", __func__, from, to, > + ret =3D=3D COROUTINE_YIELD ? "yield" : "terminate"); > =20 > qemu_co_queue_run_restart(to); > =20 > @@ -111,6 +114,7 @@ static void coroutine_swap(Coroutine *from, Coroutine= *to) > case COROUTINE_YIELD: > return; > case COROUTINE_TERMINATE: > + fprintf(stderr, "coroutine_delete %p\n", to); > trace_qemu_coroutine_terminate(to); > coroutine_delete(to); > return; > diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c > index 27d1b6f..d44c428 100644 > --- a/tests/test-coroutine.c > +++ b/tests/test-coroutine.c > @@ -13,6 +13,7 @@ > =20 > #include > #include "block/coroutine.h" > +#include "block/coroutine_int.h" > =20 > /* > * Check that qemu_in_coroutine() works > @@ -122,6 +123,35 @@ static void test_yield(void) > g_assert_cmpint(i, =3D=3D, 5); /* coroutine must yield 5 times */ > } > =20 > +static void coroutine_fn c2_fn(void *opaque) > +{ > + fprintf(stderr, "c2 Part 1\n"); > + qemu_coroutine_yield(); > + fprintf(stderr, "c2 Part 2\n"); > +} > + > +static void coroutine_fn c1_fn(void *opaque) > +{ > + Coroutine *c2 =3D opaque; > + > + fprintf(stderr, "c1 Part 1\n"); > + qemu_coroutine_enter(c2, NULL); > + fprintf(stderr, "c1 Part 2\n"); > +} > + > +static void test_co_queue(void) > +{ > + Coroutine *c1; > + Coroutine *c2; > + > + c1 =3D qemu_coroutine_create(c1_fn); > + c2 =3D qemu_coroutine_create(c2_fn); > + > + qemu_coroutine_enter(c1, c2); > + memset(c1, 0xff, sizeof(Coroutine)); > + qemu_coroutine_enter(c2, NULL); > +} > + > /* > * Check that creation, enter, and return work > */ > @@ -343,6 +373,7 @@ static void perf_cost(void) > int main(int argc, char **argv) > { > g_test_init(&argc, &argv, NULL); > + g_test_add_func("/basic/co_queue", test_co_queue); > g_test_add_func("/basic/lifecycle", test_lifecycle); > g_test_add_func("/basic/yield", test_yield); > g_test_add_func("/basic/nesting", test_nesting); >=20 > Here is the output (with printfs in coroutine_swap): >=20 > -> coroutine_swap from MAIN to C1 > c1 Part 1 > -> coroutine_swap from C1 to C2 > c2 Part 1 > -> coroutine_swap from C2 to C1 > <- coroutine_swap from C1 to C2 switch yield > c1 Part 2 > <- coroutine_swap from MAIN to C1 switch terminate > coroutine_delete C1 > -> coroutine_swap from MAIN to C2 > <- coroutine_swap from C2 to C1 switch yield !!! > c2 Part 2 > <- coroutine_swap from MAIN to C2 switch terminate > coroutine_delete C2 >=20 > I have marked the problematic line with "!!!". The to=3DC1 variable is > used after C1 has been deleted. >=20 > The test crashes since it writes 0xff to C1 after it has terminated. >=20 > > Signed-off-by: Bin Wu > > --- > > qemu-coroutine.c | 16 ++++++++++------ > > 1 file changed, 10 insertions(+), 6 deletions(-) > >=20 > > diff --git a/qemu-coroutine.c b/qemu-coroutine.c > > index 525247b..cc0bdfa 100644 > > --- a/qemu-coroutine.c > > +++ b/qemu-coroutine.c > > @@ -99,29 +99,31 @@ static void coroutine_delete(Coroutine *co) > > qemu_coroutine_delete(co); > > } > > =20 > > -static void coroutine_swap(Coroutine *from, Coroutine *to) > > +static CoroutineAction coroutine_swap(Coroutine *from, Coroutine *to) > > { > > CoroutineAction ret; > > =20 > > ret =3D qemu_coroutine_switch(from, to, COROUTINE_YIELD); > > =20 > > - qemu_co_queue_run_restart(to); > > - > > switch (ret) { > > case COROUTINE_YIELD: > > - return; > > + break; > > case COROUTINE_TERMINATE: > > trace_qemu_coroutine_terminate(to); > > + qemu_co_queue_run_restart(to); > > coroutine_delete(to); > > - return; > > + break; > > default: > > abort(); > > } > > + > > + return ret; > > } > > =20 > > void qemu_coroutine_enter(Coroutine *co, void *opaque) > > { > > Coroutine *self =3D qemu_coroutine_self(); > > + CoroutineAction ret; > > =20 > > trace_qemu_coroutine_enter(self, co, opaque); > > =20 > > @@ -132,7 +134,9 @@ void qemu_coroutine_enter(Coroutine *co, void *opaq= ue) > > =20 > > co->caller =3D self; > > co->entry_arg =3D opaque; > > - coroutine_swap(self, co); > > + ret =3D coroutine_swap(self, co); > > + if (ret =3D=3D COROUTINE_YIELD) > > + qemu_co_queue_run_restart(co); > > } >=20 > Your fix looks correct although QEMU coding style requires {}. >=20 > I tried to think of a simpler solution that keeps a single > qemu_co_queue_run_restart() call but was unable to find one. This one-liner is the core of the real fix: --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -148,5 +148,5 @@ void coroutine_fn qemu_coroutine_yield(void) } =20 self->caller =3D NULL; - coroutine_swap(self, to); + qemu_coroutine_switch(self, to, COROUTINE_YIELD); } There is no reason at all for qemu_coroutine_yield() to go through coroutine_swap(). The cleanups that can be made on top is to inline coroutine_swap() in qemu_coroutine_enter(), and to change its use of COROUTINE_YIELD into a new COROUTINE_ENTER (which is never looked at anyway, but we're not yielding here). All the other solutions take a unnecessarily complicated design as granted and further complicate it to distinguish two unrelated code paths. Kevin --u3/rZRmxL6MmkK24 Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAEBAgAGBQJU2dm+AAoJEH8JsnLIjy/WGp4P/RKiUYqmfqogiWuGVqe/UsP1 hCPsFiH6YBwxF+VR2fhApyyLfHaKIEOa3ebxNTbfllN9ZlGefvWgwJdElbjKmfjJ MmzP+ZulnpTnsgpchdSvTpFr+mhtRsM7lMoqBhU85BeTkG+XrElamYMlMa+2YFTb IBVSJlVkrCDT1IEg6Q8QbfEZDc3fduJFcODHlejV4CC4cgN8dt0paNJA2TgLzmaR WJgyCM21sOEm6BRB2HAYhxAW1uiQfdfRcHMaFnw0Xcxe5n5Q17SSH2Rw6VKcHT+A lyEhXKqblKOCz0z05J1fY5TMBOcCjdNRaby+2n7kAJAr7V+9aKMpE6Z8iu7pFT3B RtB5yMbp8YPVSEYvvAFzyj6VypGckWxcR0Dwp40BEPtOp8STjuI5YGsOE6qh8SST At+sqPa2bIJsr70xjF3X8vTMj12fBQggzytpO/qOcJRICoEqUSKaJyQSSAIUa2gT JM7V2pglgHu8v18BQ1TJsDDEUnY7NOMkYYc0sgOT5roetzUiM98bfuxNCQkfXpel Xx8YmgUjlrhZM4sV9b8oYXwFpbYr+dxS7Rf1AexxKz1QuiXfvTwHIEBbMGh+7C9g hr9xi9YGZav/QLUhPjMxrEsKu4dJI8i654xVLFClg5UdX3XyiwA7rW3w/VngxeJP Ce4AdUXWjb52BWO2c2bm =0YUf -----END PGP SIGNATURE----- --u3/rZRmxL6MmkK24--