From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36533) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZFgEY-0007Nc-VX for qemu-devel@nongnu.org; Thu, 16 Jul 2015 06:17:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZFgEU-0005hL-SE for qemu-devel@nongnu.org; Thu, 16 Jul 2015 06:17:42 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33678) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZFgEU-0005h0-Kc for qemu-devel@nongnu.org; Thu, 16 Jul 2015 06:17:38 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (Postfix) with ESMTPS id 9B6F735B8BB for ; Thu, 16 Jul 2015 10:17:37 +0000 (UTC) Date: Thu, 16 Jul 2015 11:17:35 +0100 From: Stefan Hajnoczi Message-ID: <20150716101735.GD3963@stefanha-thinkpad.redhat.com> References: <1436980403-25898-1-git-send-email-pbonzini@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="u65IjBhB3TIa72Vp" Content-Disposition: inline In-Reply-To: <1436980403-25898-1-git-send-email-pbonzini@redhat.com> Subject: Re: [Qemu-devel] [PATCH] AioContext: fix broken ctx->dispatching optimization List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Paolo Bonzini Cc: kwolf@redhat.com, lersek@redhat.com, qemu-devel@nongnu.org, rjones@redhat.com --u65IjBhB3TIa72Vp Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 15, 2015 at 07:13:23PM +0200, Paolo Bonzini wrote: > This patch rewrites the ctx->dispatching optimization, which was the cause > of some mysterious hangs that could be reproduced on aarch64 KVM only. > The hangs were indirectly caused by aio_poll() and in particular by > flash memory updates's call to blk_write(), which invokes aio_poll(). > Fun stuff: they had an extremely short race window, so much that > adding all kind of tracing to either the kernel or QEMU made it > go away (a single printf made it half as reproducible). >=20 > On the plus side, the failure mode (a hang until the next keypress) > made it very easy to examine the state of the process with a debugger. > And there was a very nice reproducer from Laszlo, which failed pretty > often (more than half of the time) on any version of QEMU with a non-debug > kernel; it also failed fast, while still in the firmware. So, it could > have been worse. >=20 > For some unknown reason they happened only with virtio-scsi, but > that's not important. It's more interesting that they disappeared with > io=3Dnative, making thread-pool.c a likely suspect for where the bug aros= e. > thread-pool.c is also one of the few places which use bottom halves > across threads, by the way. >=20 > I hope that no other similar bugs exist, but just in case :) I am > going to describe how the successful debugging went... Since the > likely culprit was the ctx->dispatching optimization, which mostly > affects bottom halves, the first observation was that there are two > qemu_bh_schedule() invocations in the thread pool: the one in the aio > worker and the one in thread_pool_completion_bh. The latter always > causes the optimization to trigger, the former may or may not. In > order to restrict the possibilities, I introduced new functions > qemu_bh_schedule_slow() and qemu_bh_schedule_fast(): >=20 > /* qemu_bh_schedule_slow: */ > ctx =3D bh->ctx; > bh->idle =3D 0; > if (atomic_xchg(&bh->scheduled, 1) =3D=3D 0) { > event_notifier_set(&ctx->notifier); > } >=20 > /* qemu_bh_schedule_fast: */ > ctx =3D bh->ctx; > bh->idle =3D 0; > assert(ctx->dispatching); > atomic_xchg(&bh->scheduled, 1); >=20 > Notice how the atomic_xchg is still in qemu_bh_schedule_slow(). This > was already debated a few months ago, so I assumed it to be correct. > In retrospect this was a very good idea, as you'll see later. >=20 > Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't > trigger the assertion (as expected). Changing the worker's invocation > to qemu_bh_schedule_slow() didn't hide the bug (another assumption > which luckily held). This already limited heavily the amount of > interaction between the threads, hinting that the problematic events > must have triggered around thread_pool_completion_bh(). >=20 > As mentioned early, invoking a debugger to examine the state of a > hung process was pretty easy; the iothread was always waiting on a > poll(..., -1) system call. Infinite timeouts are much rarer on x86, > and this could be the reason why the bug was never observed there. > With the buggy sequence more or less resolved to an interaction between > thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was > to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping > that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and > qemu_bh_schedule_fast() would provide some hint. The output was: >=20 > (gdb) p last_prepare > $3 =3D 103885451 > (gdb) p last_dispatch > $4 =3D 103876492 > (gdb) p last_poll > $5 =3D 115909333 > (gdb) p last_schedule > $6 =3D 115925212 >=20 > Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch(). > This makes little sense unless there is an aio_poll() call involved, > and indeed with a slightly different instrumentation you can see that > there is one: >=20 > (gdb) p last_prepare > $3 =3D 107569679 > (gdb) p last_dispatch > $4 =3D 107561600 > (gdb) p last_aio_poll > $5 =3D 110671400 > (gdb) p last_schedule > $6 =3D 110698917 >=20 > So the scenario becomes clearer: >=20 > iothread VCPU thread > -------------------------------------------------------------------------- > aio_ctx_prepare > aio_ctx_check > qemu_poll_ns(timeout=3D-1) > aio_poll > aio_dispatch > thread_pool_completion_bh > qemu_bh_schedule() >=20 > At this point bh->scheduled =3D 1 and the iothread has not been woken up. > The solution must be close, but this alone should not be a problem, > because the bottom half is only rescheduled to account for rare situations > (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll() > calls, 2014-07-15). >=20 > Introducing a third thread---a thread pool worker thread, which > also does qemu_bh_schedule()---does bring out the problematic case. > The third thread must be awakened *after* the callback is complete and > thread_pool_completion_bh has redone the whole loop, explaining the > short race window. And then this is what happens: >=20 > thread pool worker > -------------------------------------------------------------------------- > > qemu_bh_schedule() >=20 > Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing > and the iothread is never woken up. This is where the bh->scheduled > optimization comes into play---it is correct, but removing it would > have masked the bug. >=20 > So, what is the bug? >=20 > Well, the question asked by the ctx->dispatching optimization ("is any > active aio_poll dispatching?") was wrong. The right question to ask > instead is "is any active aio_poll *not* dispatching", i.e. in the prepare > or poll phases? In that case, the aio_poll is sleeping or might go to > sleep anytime soon, and the EventNotifier must be invoked to wake > it up. >=20 > In any other case (including if there is *no* active aio_poll at all!) > we can just wait for the next prepare phase to pick up the event (e.g. a > bottom half); the prepare phase will avoid the blocking and service the > bottom half. >=20 > Expressing the invariant with a logic formula, the broken one looked like: >=20 > !(exists(thread): in_dispatching(thread)) =3D> !optimize >=20 > or equivalently: >=20 > !(exists(thread): > in_aio_poll(thread) && in_dispatching(thread)) =3D> !optimize >=20 > In the correct one, the negation is in a slightly different place: >=20 > (exists(thread): > in_aio_poll(thread) && !in_dispatching(thread)) =3D> !optimize >=20 > or equivalently: >=20 > (exists(thread): in_prepare_or_poll(thread)) =3D> !optimize >=20 > Even if the difference boils down to moving an exclamation mark :) > the implementation is quite different. However, I think the new > one is simpler to understand. >=20 > In the old implementation, the "exists" was implemented with a boolean > value. This didn't really support well the case of multiple concurrent > event loops, but I thought was that this was okay: aio_poll holds the > AioContext lock so there cannot be concurrent aio_poll invocations, and > I was just considering nested event loops. However, aio_poll _could_ > indeed be concurrent with the GSource. This is why I came up with the > wrong invariant. >=20 > In the new implementation, "exists" is computed simply by counting how ma= ny > threads are in the prepare or poll phases. There are some interesting > points to consider, but the gist of the idea remains: >=20 > 1) AioContext can be used through GSource as well; as mentioned in the > patch, bit 0 of the counter is reserved for the GSource. >=20 > 2) the counter need not be updated for a non-blocking aio_poll, because > it won't sleep forever anyway. This is just a matter of checking > the "blocking" variable. This requires some changes to the win32 > implementation, but is otherwise not too complicated. >=20 > 3) as mentioned above, the new implementation will not call aio_notify > when there is *no* active aio_poll at all. The tests have to be > adjusted for this change. The calls to aio_notify in async.c are fine; > they only want to kick aio_poll out of a blocking wait, but need not > do anything if aio_poll is not running. >=20 > 4) nested aio_poll: these just work with the new implementation; when > a nested event loop is invoked, the outer event loop is never in the > prepare or poll phases. The outer event loop thus has already decremented > the counter. >=20 > Reported-by: Richard W. M. Jones > Reported-by: Laszlo Ersek > Signed-off-by: Paolo Bonzini > --- > aio-posix.c | 19 ++++++------ > aio-win32.c | 33 ++++++++++----------- > async.c | 21 +++++--------- > docs/aio_notify.promela | 77 +++++++++++++++++++++----------------------= ------ > include/block/aio.h | 29 +++++++++++++++---- > tests/test-aio.c | 26 +++-------------- > 6 files changed, 91 insertions(+), 114 deletions(-) Looks good for QEMU 2.4: Reviewed-by: Stefan Hajnoczi I'll merge it for the next -rc unless there is any more discussion. --u65IjBhB3TIa72Vp Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJVp4S/AAoJEJykq7OBq3PIHUgH/1Y2JlSiiYKxU5jFAnAOBFwf EKzzg403Z3Ksd1haCtX85m2ZasPTeA+IrWjSLJyc/wNByAkzkUwRN6P2m4Kyyx2j vzNCRkzVTiQ0+RfENZf+bYPbxlbnCohSVLjeMpC/XCyNgMhNfVdlvFi0YQJua31D Dp3DVBDd2ZCGqVAgmKBj5NDKWmyUWSkQlRCFuIxAr+6cCzR88hqS832w6UEsg84q Go2cMGEmV2ihFvHC2w/hc+h4wWaG685i+uqe2IZiPhytQ0FiGynAPmEWdNqjtGym RPRH4gL5vt7eun25aKj/aZIaXY2GiJiB/rlt++jrQHph0scSD7k3HjP+8R0URJE= =Mmn5 -----END PGP SIGNATURE----- --u65IjBhB3TIa72Vp--