From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35742) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XEf1m-0002LC-1c for qemu-devel@nongnu.org; Tue, 05 Aug 2014 09:43:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XEf1f-00075y-Dg for qemu-devel@nongnu.org; Tue, 05 Aug 2014 09:43:46 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44715) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XEf1f-00075q-5f for qemu-devel@nongnu.org; Tue, 05 Aug 2014 09:43:39 -0400 Date: Tue, 5 Aug 2014 14:43:30 +0100 From: Stefan Hajnoczi Message-ID: <20140805134330.GC12251@stefanha-thinkpad.redhat.com> References: <1407167793-20425-1-git-send-email-stefanha@redhat.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="UPT3ojh+0CqEDtpF" Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [PATCH] linux-aio: avoid deadlock in nested aio_poll() calls List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ming Lei Cc: Kevin Wolf , Paolo Bonzini , Marcin =?utf-8?Q?Gibu=C5=82a?= , qemu-devel --UPT3ojh+0CqEDtpF Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Tue, Aug 05, 2014 at 06:44:25PM +0800, Ming Lei wrote: > On Mon, Aug 4, 2014 at 11:56 PM, Stefan Hajnoczi wr= ote: > > If two Linux AIO request completions are fetched in the same > > io_getevents() call, QEMU will deadlock if request A's callback waits > > for request B to complete using an aio_poll() loop. This was reported > > to happen with the mirror blockjob. > > > > This patch moves completion processing into a BH and makes it resumable. > > Nested event loops can resume completion processing so that request B > > will complete and the deadlock will not occur. > > > > Cc: Kevin Wolf > > Cc: Paolo Bonzini > > Cc: Ming Lei > > Cc: Marcin Gibu=C5=82a > > Reported-by: Marcin Gibu=C5=82a > > Signed-off-by: Stefan Hajnoczi > > --- > > block/linux-aio.c | 71 ++++++++++++++++++++++++++++++++++++++++++-----= -------- > > 1 file changed, 55 insertions(+), 16 deletions(-) > > > > diff --git a/block/linux-aio.c b/block/linux-aio.c > > index 7ac7e8c..9aca758 100644 > > --- a/block/linux-aio.c > > +++ b/block/linux-aio.c > > @@ -51,6 +51,12 @@ struct qemu_laio_state { > > > > /* io queue for submit at batch */ > > LaioQueue io_q; > > + > > + /* I/O completion processing */ > > + QEMUBH *completion_bh; > > + struct io_event events[MAX_EVENTS]; > > + int event_idx; > > + int event_max; > > }; > > > > static inline ssize_t io_event_ret(struct io_event *ev) > > @@ -86,27 +92,58 @@ static void qemu_laio_process_completion(struct qem= u_laio_state *s, > > qemu_aio_release(laiocb); > > } > > > > -static void qemu_laio_completion_cb(EventNotifier *e) > > +/* The completion BH fetches completed I/O requests and invokes their > > + * callbacks. > > + * > > + * The function is somewhat tricky because it supports nested event lo= ops, for > > + * example when a request callback invokes aio_poll(). In order to do= this, >=20 > Looks it is a very tricky usage, maybe it is better to change the caller. This comment is not about usage. It's just for people reading the implementation. I can move it inside the function body, if you like. I like the idea of eliminating nested event loops, but it requires a huge change: making all callers either async (using callbacks) or coroutines so they can yield. There are many callers so this is a lot of work and will have side-effects too. BTW, here is the thread-pool.c fix which is analogous to this patch: https://lists.nongnu.org/archive/html/qemu-devel/2014-07/msg02437.html > > + * the completion events array and index are kept in qemu_laio_state. = The BH > > + * reschedules itself as long as there are completions pending so it w= ill > > + * either be called again in a nested event loop or will be called aft= er all > > + * events have been completed. When there are no events left to compl= ete, the > > + * BH returns without rescheduling. > > + */ > > +static void qemu_laio_completion_bh(void *opaque) > > { > > - struct qemu_laio_state *s =3D container_of(e, struct qemu_laio_sta= te, e); > > - > > - while (event_notifier_test_and_clear(&s->e)) { > > - struct io_event events[MAX_EVENTS]; > > - struct timespec ts =3D { 0 }; > > - int nevents, i; > > + struct qemu_laio_state *s =3D opaque; > > > > + /* Fetch more completion events when empty */ > > + if (s->event_idx =3D=3D s->event_max) { > > do { > > - nevents =3D io_getevents(s->ctx, MAX_EVENTS, MAX_EVENTS, e= vents, &ts); > > - } while (nevents =3D=3D -EINTR); > > + struct timespec ts =3D { 0 }; > > + s->event_max =3D io_getevents(s->ctx, MAX_EVENTS, MAX_EVEN= TS, > > + s->events, &ts); > > + } while (s->event_max =3D=3D -EINTR); > > + > > + s->event_idx =3D 0; > > + if (s->event_max <=3D 0) { > > + s->event_max =3D 0; > > + return; /* no more events */ > > + } > > + } > > > > - for (i =3D 0; i < nevents; i++) { > > - struct iocb *iocb =3D events[i].obj; > > - struct qemu_laiocb *laiocb =3D > > - container_of(iocb, struct qemu_laiocb, iocb); > > + /* Reschedule so nested event loops see currently pending completi= ons */ > > + qemu_bh_schedule(s->completion_bh); > > > > - laiocb->ret =3D io_event_ret(&events[i]); > > - qemu_laio_process_completion(s, laiocb); > > - } > > + /* Process completion events */ > > + while (s->event_idx < s->event_max) { > > + struct iocb *iocb =3D s->events[s->event_idx].obj; > > + struct qemu_laiocb *laiocb =3D > > + container_of(iocb, struct qemu_laiocb, iocb); > > + > > + laiocb->ret =3D io_event_ret(&s->events[s->event_idx]); > > + s->event_idx++; > > + > > + qemu_laio_process_completion(s, laiocb); >=20 > The implementation is same tricky with the usage, :-) >=20 > Also using a FIFO style implementation should be more efficient > since IO events can still be read and completed in current BH handler > if the queue isn't full, but becomes more complicated. That might help but should be benchmarked. Another trick is calling qemu_laio_completion_bh() directly from qemu_laio_completion_cb() to avoid a BH iteration. I think they are premature optimization. Let's first agree whether this fix is correct or not :). Stefan --UPT3ojh+0CqEDtpF Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBAgAGBQJT4N+CAAoJEJykq7OBq3PI0LcIAMIM5cfnHAerJ47Bbu0C1Pth L1PQxFYkofrmr4LHQ4oCsVKg1qGPxhkCJHVFB+1wm0UiclFBkECzM+E7vR6usEPv aAMEhdQ8zWz9gZ90IYlQXq2Im+ysy2f3ORfgUi+GDlcztCp9le2BUEPzRGlfOfzQ xBz0rwKP2MKtxYGax44m8xDvW2sViWqFyNMdEuUO2P0s80MFsV1Onav+FnrPKl9/ ZQNVceEMHviJxbFooc3fiJ4vhH5Q00J73D7/fDEzRjboaE2/h39ObUZ4CUr/wXbk DgP16jlKWPAOa5tN0J1BVPvC4x0mCjit/biJIYn44t37Ohhkx4bpcr4nJ2GWxGw= =oxqL -----END PGP SIGNATURE----- --UPT3ojh+0CqEDtpF--