From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:36533)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1ZFgEY-0007Nc-VX
	for qemu-devel@nongnu.org; Thu, 16 Jul 2015 06:17:44 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1ZFgEU-0005hL-SE
	for qemu-devel@nongnu.org; Thu, 16 Jul 2015 06:17:42 -0400
Received: from mx1.redhat.com ([209.132.183.28]:33678)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <stefanha@redhat.com>) id 1ZFgEU-0005h0-Kc
	for qemu-devel@nongnu.org; Thu, 16 Jul 2015 06:17:38 -0400
Received: from int-mx13.intmail.prod.int.phx2.redhat.com
	(int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])
	by mx1.redhat.com (Postfix) with ESMTPS id 9B6F735B8BB
	for <qemu-devel@nongnu.org>; Thu, 16 Jul 2015 10:17:37 +0000 (UTC)
Date: Thu, 16 Jul 2015 11:17:35 +0100
From: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20150716101735.GD3963@stefanha-thinkpad.redhat.com>
References: <1436980403-25898-1-git-send-email-pbonzini@redhat.com>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha1;
	protocol="application/pgp-signature"; boundary="u65IjBhB3TIa72Vp"
Content-Disposition: inline
In-Reply-To: <1436980403-25898-1-git-send-email-pbonzini@redhat.com>
Subject: Re: [Qemu-devel] [PATCH] AioContext: fix broken ctx->dispatching
	optimization
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: kwolf@redhat.com, lersek@redhat.com, qemu-devel@nongnu.org, rjones@redhat.com


--u65IjBhB3TIa72Vp
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 15, 2015 at 07:13:23PM +0200, Paolo Bonzini wrote:
> This patch rewrites the ctx->dispatching optimization, which was the cause
> of some mysterious hangs that could be reproduced on aarch64 KVM only.
> The hangs were indirectly caused by aio_poll() and in particular by
> flash memory updates's call to blk_write(), which invokes aio_poll().
> Fun stuff: they had an extremely short race window, so much that
> adding all kind of tracing to either the kernel or QEMU made it
> go away (a single printf made it half as reproducible).
>=20
> On the plus side, the failure mode (a hang until the next keypress)
> made it very easy to examine the state of the process with a debugger.
> And there was a very nice reproducer from Laszlo, which failed pretty
> often (more than half of the time) on any version of QEMU with a non-debug
> kernel; it also failed fast, while still in the firmware.  So, it could
> have been worse.
>=20
> For some unknown reason they happened only with virtio-scsi, but
> that's not important.  It's more interesting that they disappeared with
> io=3Dnative, making thread-pool.c a likely suspect for where the bug aros=
e.
> thread-pool.c is also one of the few places which use bottom halves
> across threads, by the way.
>=20
> I hope that no other similar bugs exist, but just in case :) I am
> going to describe how the successful debugging went...  Since the
> likely culprit was the ctx->dispatching optimization, which mostly
> affects bottom halves, the first observation was that there are two
> qemu_bh_schedule() invocations in the thread pool: the one in the aio
> worker and the one in thread_pool_completion_bh.  The latter always
> causes the optimization to trigger, the former may or may not.  In
> order to restrict the possibilities, I introduced new functions
> qemu_bh_schedule_slow() and qemu_bh_schedule_fast():
>=20
>      /* qemu_bh_schedule_slow: */
>      ctx =3D bh->ctx;
>      bh->idle =3D 0;
>      if (atomic_xchg(&bh->scheduled, 1) =3D=3D 0) {
>          event_notifier_set(&ctx->notifier);
>      }
>=20
>      /* qemu_bh_schedule_fast: */
>      ctx =3D bh->ctx;
>      bh->idle =3D 0;
>      assert(ctx->dispatching);
>      atomic_xchg(&bh->scheduled, 1);
>=20
> Notice how the atomic_xchg is still in qemu_bh_schedule_slow().  This
> was already debated a few months ago, so I assumed it to be correct.
> In retrospect this was a very good idea, as you'll see later.
>=20
> Changing thread_pool_completion_bh() to qemu_bh_schedule_fast() didn't
> trigger the assertion (as expected).  Changing the worker's invocation
> to qemu_bh_schedule_slow() didn't hide the bug (another assumption
> which luckily held).  This already limited heavily the amount of
> interaction between the threads, hinting that the problematic events
> must have triggered around thread_pool_completion_bh().
>=20
> As mentioned early, invoking a debugger to examine the state of a
> hung process was pretty easy; the iothread was always waiting on a
> poll(..., -1) system call.  Infinite timeouts are much rarer on x86,
> and this could be the reason why the bug was never observed there.
> With the buggy sequence more or less resolved to an interaction between
> thread_pool_completion_bh() and poll(..., -1), my "tracing" strategy was
> to just add a few qemu_clock_get_ns(QEMU_CLOCK_REALTIME) calls, hoping
> that the ordering of aio_ctx_prepare(), aio_ctx_dispatch, poll() and
> qemu_bh_schedule_fast() would provide some hint.  The output was:
>=20
>     (gdb) p last_prepare
>     $3 =3D 103885451
>     (gdb) p last_dispatch
>     $4 =3D 103876492
>     (gdb) p last_poll
>     $5 =3D 115909333
>     (gdb) p last_schedule
>     $6 =3D 115925212
>=20
> Notice how the last call to qemu_poll_ns() came after aio_ctx_dispatch().
> This makes little sense unless there is an aio_poll() call involved,
> and indeed with a slightly different instrumentation you can see that
> there is one:
>=20
>     (gdb) p last_prepare
>     $3 =3D 107569679
>     (gdb) p last_dispatch
>     $4 =3D 107561600
>     (gdb) p last_aio_poll
>     $5 =3D 110671400
>     (gdb) p last_schedule
>     $6 =3D 110698917
>=20
> So the scenario becomes clearer:
>=20
>    iothread                   VCPU thread
> --------------------------------------------------------------------------
>    aio_ctx_prepare
>    aio_ctx_check
>    qemu_poll_ns(timeout=3D-1)
>                               aio_poll
>                                 aio_dispatch
>                                   thread_pool_completion_bh
>                                     qemu_bh_schedule()
>=20
> At this point bh->scheduled =3D 1 and the iothread has not been woken up.
> The solution must be close, but this alone should not be a problem,
> because the bottom half is only rescheduled to account for rare situations
> (see commit 3c80ca1, thread-pool: avoid deadlock in nested aio_poll()
> calls, 2014-07-15).
>=20
> Introducing a third thread---a thread pool worker thread, which
> also does qemu_bh_schedule()---does bring out the problematic case.
> The third thread must be awakened *after* the callback is complete and
> thread_pool_completion_bh has redone the whole loop, explaining the
> short race window.  And then this is what happens:
>=20
>                                                       thread pool worker
> --------------------------------------------------------------------------
>                                                       <I/O completes>
>                                                       qemu_bh_schedule()
>=20
> Tada, bh->scheduled is already 1, so qemu_bh_schedule() does nothing
> and the iothread is never woken up.  This is where the bh->scheduled
> optimization comes into play---it is correct, but removing it would
> have masked the bug.
>=20
> So, what is the bug?
>=20
> Well, the question asked by the ctx->dispatching optimization ("is any
> active aio_poll dispatching?") was wrong.  The right question to ask
> instead is "is any active aio_poll *not* dispatching", i.e. in the prepare
> or poll phases?  In that case, the aio_poll is sleeping or might go to
> sleep anytime soon, and the EventNotifier must be invoked to wake
> it up.
>=20
> In any other case (including if there is *no* active aio_poll at all!)
> we can just wait for the next prepare phase to pick up the event (e.g. a
> bottom half); the prepare phase will avoid the blocking and service the
> bottom half.
>=20
> Expressing the invariant with a logic formula, the broken one looked like:
>=20
>    !(exists(thread): in_dispatching(thread)) =3D> !optimize
>=20
> or equivalently:
>=20
>    !(exists(thread):
>           in_aio_poll(thread) && in_dispatching(thread)) =3D> !optimize
>=20
> In the correct one, the negation is in a slightly different place:
>=20
>    (exists(thread):
>          in_aio_poll(thread) && !in_dispatching(thread)) =3D> !optimize
>=20
> or equivalently:
>=20
>    (exists(thread): in_prepare_or_poll(thread)) =3D> !optimize
>=20
> Even if the difference boils down to moving an exclamation mark :)
> the implementation is quite different.  However, I think the new
> one is simpler to understand.
>=20
> In the old implementation, the "exists" was implemented with a boolean
> value.  This didn't really support well the case of multiple concurrent
> event loops, but I thought was that this was okay: aio_poll holds the
> AioContext lock so there cannot be concurrent aio_poll invocations, and
> I was just considering nested event loops.  However, aio_poll _could_
> indeed be concurrent with the GSource.  This is why I came up with the
> wrong invariant.
>=20
> In the new implementation, "exists" is computed simply by counting how ma=
ny
> threads are in the prepare or poll phases.  There are some interesting
> points to consider, but the gist of the idea remains:
>=20
> 1) AioContext can be used through GSource as well; as mentioned in the
> patch, bit 0 of the counter is reserved for the GSource.
>=20
> 2) the counter need not be updated for a non-blocking aio_poll, because
> it won't sleep forever anyway.  This is just a matter of checking
> the "blocking" variable.  This requires some changes to the win32
> implementation, but is otherwise not too complicated.
>=20
> 3) as mentioned above, the new implementation will not call aio_notify
> when there is *no* active aio_poll at all.  The tests have to be
> adjusted for this change.  The calls to aio_notify in async.c are fine;
> they only want to kick aio_poll out of a blocking wait, but need not
> do anything if aio_poll is not running.
>=20
> 4) nested aio_poll: these just work with the new implementation; when
> a nested event loop is invoked, the outer event loop is never in the
> prepare or poll phases.  The outer event loop thus has already decremented
> the counter.
>=20
> Reported-by: Richard W. M. Jones <rjones@redhat.com>
> Reported-by: Laszlo Ersek <lersek@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  aio-posix.c             | 19 ++++++------
>  aio-win32.c             | 33 ++++++++++-----------
>  async.c                 | 21 +++++---------
>  docs/aio_notify.promela | 77 +++++++++++++++++++++----------------------=
------
>  include/block/aio.h     | 29 +++++++++++++++----
>  tests/test-aio.c        | 26 +++--------------
>  6 files changed, 91 insertions(+), 114 deletions(-)

Looks good for QEMU 2.4:

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

I'll merge it for the next -rc unless there is any more discussion.

--u65IjBhB3TIa72Vp
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVp4S/AAoJEJykq7OBq3PIHUgH/1Y2JlSiiYKxU5jFAnAOBFwf
EKzzg403Z3Ksd1haCtX85m2ZasPTeA+IrWjSLJyc/wNByAkzkUwRN6P2m4Kyyx2j
vzNCRkzVTiQ0+RfENZf+bYPbxlbnCohSVLjeMpC/XCyNgMhNfVdlvFi0YQJua31D
Dp3DVBDd2ZCGqVAgmKBj5NDKWmyUWSkQlRCFuIxAr+6cCzR88hqS832w6UEsg84q
Go2cMGEmV2ihFvHC2w/hc+h4wWaG685i+uqe2IZiPhytQ0FiGynAPmEWdNqjtGym
RPRH4gL5vt7eun25aKj/aZIaXY2GiJiB/rlt++jrQHph0scSD7k3HjP+8R0URJE=
=Mmn5
-----END PGP SIGNATURE-----

--u65IjBhB3TIa72Vp--