From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33871)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1anFmW-0001Ba-1p
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 21:27:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <famz@redhat.com>) id 1anFmV-0000FI-1E
	for qemu-devel@nongnu.org; Mon, 04 Apr 2016 21:27:48 -0400
Date: Tue, 5 Apr 2016 09:27:39 +0800
From: Fam Zheng <famz@redhat.com>
Message-ID: <20160405012739.GA670@ad.usersys.redhat.com>
References: <1459519058-29864-1-git-send-email-famz@redhat.com>
	<20160404115734.GA10964@stefanha-x1.localdomain>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20160404115734.GA10964@stefanha-x1.localdomain>
Subject: Re: [Qemu-devel] [PATCH v2] block: Fix bdrv_drain in coroutine
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, lvivier@redhat.com, qemu-devel@nongnu.org, qemu-block@nongnu.org, pbonzini@redhat.com

On Mon, 04/04 12:57, Stefan Hajnoczi wrote:
> On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote:
> > Using the nested aio_poll() in coroutine is a bad idea. This patch
> > replaces the aio_poll loop in bdrv_drain with a BH, if called in
> > coroutine.
> > 
> > For example, the bdrv_drain() in mirror.c can hang when a guest issued
> > request is pending on it in qemu_co_mutex_lock().
> > 
> > Mirror coroutine in this case has just finished a request, and the block
> > job is about to complete. It calls bdrv_drain() which waits for the
> > other coroutine to complete. The other coroutine is a scsi-disk request.
> > The deadlock happens when the latter is in turn pending on the former to
> > yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
> > (assuming a qcow2 image):
> > 
> >   mirror coroutine               scsi-disk coroutine
> >   -------------------------------------------------------------
> >   do last write
> > 
> >     qcow2:qemu_co_mutex_lock()
> >     ...
> >                                  scsi disk read
> > 
> >                                    tracked request begin
> > 
> >                                    qcow2:qemu_co_mutex_lock.enter
> > 
> >     qcow2:qemu_co_mutex_unlock()
> > 
> >   bdrv_drain
> >     while (has tracked request)
> >       aio_poll()
> > 
> > In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
> > because the mirror coroutine is blocked in the aio_poll(blocking=true).
> > 
> > With this patch, the added qemu_coroutine_yield() allows the scsi-disk
> > coroutine to make progress as expected:
> > 
> >   mirror coroutine               scsi-disk coroutine
> >   -------------------------------------------------------------
> >   do last write
> > 
> >     qcow2:qemu_co_mutex_lock()
> >     ...
> >                                  scsi disk read
> > 
> >                                    tracked request begin
> > 
> >                                    qcow2:qemu_co_mutex_lock.enter
> > 
> >     qcow2:qemu_co_mutex_unlock()
> > 
> >   bdrv_drain.enter
> > >   schedule BH
> > >   qemu_coroutine_yield()
> > >                                  qcow2:qemu_co_mutex_lock.return
> > >                                  ...
> >                                    tracked request end
> >     ...
> >     (resumed from BH callback)
> >   bdrv_drain.return
> >   ...
> > 
> > Reported-by: Laurent Vivier <lvivier@redhat.com>
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > 
> > ---
> > 
> > v2: Call qemu_bh_delete() in BH callback. [Paolo]
> >     Change the loop to an assertion. [Paolo]
> >     Elaborate a bit about the fix in commit log. [Paolo]
> > ---
> >  block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 41 insertions(+)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index c4869b9..ddcfb4c 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
> >      }
> >  }
> >  
> > +typedef struct {
> > +    Coroutine *co;
> > +    BlockDriverState *bs;
> > +    QEMUBH *bh;
> > +    bool done;
> > +} BdrvCoDrainData;
> > +
> > +static void bdrv_co_drain_bh_cb(void *opaque)
> > +{
> > +    BdrvCoDrainData *data = opaque;
> > +    Coroutine *co = data->co;
> > +
> > +    qemu_bh_delete(data->bh);
> > +    bdrv_drain(data->bs);
> > +    data->done = true;
> > +    qemu_coroutine_enter(co, NULL);
> > +}
> > +
> > +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
> > +{
> 
> Please document why a BH is needed:
> 
> /* Calling bdrv_drain() from a BH ensures the
>  * current coroutine yields and other coroutines run if they were
>  * queued from qemu_co_queue_run_restart().
>  */

OK.

> 
> > +    BdrvCoDrainData data;
> > +
> > +    assert(qemu_in_coroutine());
> > +    data = (BdrvCoDrainData) {
> > +        .co = qemu_coroutine_self(),
> > +        .bs = bs,
> > +        .done = false,
> > +        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
> > +    };
> > +    qemu_bh_schedule(data.bh);
> > +
> > +    qemu_coroutine_yield();
> > +    /* If we are resumed from some other event (such as an aio completion or a
> > +     * timer callback), it is a bug in the caller that should be fixed. */
> > +    assert(data.done);
> > +}
> > +
> >  /*
> >   * Wait for pending requests to complete on a single BlockDriverState subtree,
> >   * and suspend block driver's internal I/O until next request arrives.
> > @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs)
> >      bool busy = true;
> >  
> >      bdrv_drain_recurse(bs);
> > +    if (qemu_in_coroutine()) {
> > +        bdrv_co_drain(bs);
> > +        return;
> > +    }
> >      while (busy) {
> >          /* Keep iterating */
> >           bdrv_flush_io_queue(bs);
> > -- 
> > 2.7.4
> 
> block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
> should assert(!qemu_in_coroutine()).
> 
> The reason why existing bdrv_read() and friends detect coroutine context
> at runtime is because it is meant for legacy code that runs in both
> coroutine and non-coroutine contexts.

blk_unref/bdrv_unref are called in coroutine (.bdrv_create implementations),
and this doesn't just work with the assertion. Should I clean up this "legacy"
code first, i.e. move bdrv_unref calls to BHs in the callers and
assert(!qemu_in_coroutine()) there too? I didn't think this because it
complicates the code somehow.

> 
> Modern coroutine code coroutine code the coroutine interfaces explicitly
> instead.