From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46181) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1an5nT-0004Aq-E9 for qemu-devel@nongnu.org; Mon, 04 Apr 2016 10:48:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1an5nS-0006o9-0c for qemu-devel@nongnu.org; Mon, 04 Apr 2016 10:48:07 -0400 References: <1459519058-29864-1-git-send-email-famz@redhat.com> <20160404115734.GA10964@stefanha-x1.localdomain> From: Paolo Bonzini Message-ID: <57027E9C.9060101@redhat.com> Date: Mon, 4 Apr 2016 16:47:56 +0200 MIME-Version: 1.0 In-Reply-To: <20160404115734.GA10964@stefanha-x1.localdomain> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v2] block: Fix bdrv_drain in coroutine List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi , Fam Zheng Cc: Kevin Wolf , lvivier@redhat.com, qemu-devel@nongnu.org, qemu-block@nongnu.org On 04/04/2016 13:57, Stefan Hajnoczi wrote: > On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote: >> Using the nested aio_poll() in coroutine is a bad idea. This patch >> replaces the aio_poll loop in bdrv_drain with a BH, if called in >> coroutine. >> >> For example, the bdrv_drain() in mirror.c can hang when a guest issued >> request is pending on it in qemu_co_mutex_lock(). >> >> Mirror coroutine in this case has just finished a request, and the blo= ck >> job is about to complete. It calls bdrv_drain() which waits for the >> other coroutine to complete. The other coroutine is a scsi-disk reques= t. >> The deadlock happens when the latter is in turn pending on the former = to >> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below >> (assuming a qcow2 image): >> >> mirror coroutine scsi-disk coroutine >> ------------------------------------------------------------- >> do last write >> >> qcow2:qemu_co_mutex_lock() >> ... >> scsi disk read >> >> tracked request begin >> >> qcow2:qemu_co_mutex_lock.enter >> >> qcow2:qemu_co_mutex_unlock() >> >> bdrv_drain >> while (has tracked request) >> aio_poll() >> >> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return >> because the mirror coroutine is blocked in the aio_poll(blocking=3Dtru= e). >> >> With this patch, the added qemu_coroutine_yield() allows the scsi-disk >> coroutine to make progress as expected: >> >> mirror coroutine scsi-disk coroutine >> ------------------------------------------------------------- >> do last write >> >> qcow2:qemu_co_mutex_lock() >> ... >> scsi disk read >> >> tracked request begin >> >> qcow2:qemu_co_mutex_lock.enter >> >> qcow2:qemu_co_mutex_unlock() >> >> bdrv_drain.enter >>> schedule BH >>> qemu_coroutine_yield() >>> qcow2:qemu_co_mutex_lock.return >>> ... >> tracked request end >> ... >> (resumed from BH callback) >> bdrv_drain.return >> ... >> >> Reported-by: Laurent Vivier >> Suggested-by: Paolo Bonzini >> Signed-off-by: Fam Zheng >> >> --- >> >> v2: Call qemu_bh_delete() in BH callback. [Paolo] >> Change the loop to an assertion. [Paolo] >> Elaborate a bit about the fix in commit log. [Paolo] >> --- >> block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 41 insertions(+) >> >> diff --git a/block/io.c b/block/io.c >> index c4869b9..ddcfb4c 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *= bs) >> } >> } >> =20 >> +typedef struct { >> + Coroutine *co; >> + BlockDriverState *bs; >> + QEMUBH *bh; >> + bool done; >> +} BdrvCoDrainData; >> + >> +static void bdrv_co_drain_bh_cb(void *opaque) >> +{ >> + BdrvCoDrainData *data =3D opaque; >> + Coroutine *co =3D data->co; >> + >> + qemu_bh_delete(data->bh); >> + bdrv_drain(data->bs); >> + data->done =3D true; >> + qemu_coroutine_enter(co, NULL); >> +} >> + >> +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs) >> +{ >=20 > Please document why a BH is needed: >=20 > /* Calling bdrv_drain() from a BH ensures the > * current coroutine yields and other coroutines run if they were > * queued from qemu_co_queue_run_restart(). > */ >=20 >> + BdrvCoDrainData data; >> + >> + assert(qemu_in_coroutine()); >> + data =3D (BdrvCoDrainData) { >> + .co =3D qemu_coroutine_self(), >> + .bs =3D bs, >> + .done =3D false, >> + .bh =3D aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh= _cb, &data), >> + }; >> + qemu_bh_schedule(data.bh); >> + >> + qemu_coroutine_yield(); >> + /* If we are resumed from some other event (such as an aio comple= tion or a >> + * timer callback), it is a bug in the caller that should be fixe= d. */ >> + assert(data.done); >> +} >> + >> /* >> * Wait for pending requests to complete on a single BlockDriverState= subtree, >> * and suspend block driver's internal I/O until next request arrives= . >> @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) >> bool busy =3D true; >> =20 >> bdrv_drain_recurse(bs); >> + if (qemu_in_coroutine()) { >> + bdrv_co_drain(bs); >> + return; >> + } >> while (busy) { >> /* Keep iterating */ >> bdrv_flush_io_queue(bs); >> --=20 >> 2.7.4 >=20 > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() > should assert(!qemu_in_coroutine()). >=20 > The reason why existing bdrv_read() and friends detect coroutine contex= t > at runtime is because it is meant for legacy code that runs in both > coroutine and non-coroutine contexts. >=20 > Modern coroutine code coroutine code the coroutine interfaces explicitl= y > instead. For what it's worth, I suspect Fam's patch removes the need for http://article.gmane.org/gmane.comp.emulators.qemu/401375. That's a nice bonus. :) Paolo