Re: [Qemu-devel] [PATCH v2] block: Fix bdrv_drain in coroutine

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Paolo Bonzini <pbonzini@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <famz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>,
	lvivier@redhat.com, qemu-devel@nongnu.org, qemu-block@nongnu.org
Subject: Re: [Qemu-devel] [PATCH v2] block: Fix bdrv_drain in coroutine
Date: Mon, 4 Apr 2016 16:47:56 +0200	[thread overview]
Message-ID: <57027E9C.9060101@redhat.com> (raw)
In-Reply-To: <20160404115734.GA10964@stefanha-x1.localdomain>



On 04/04/2016 13:57, Stefan Hajnoczi wrote:
> On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote:
>> Using the nested aio_poll() in coroutine is a bad idea. This patch
>> replaces the aio_poll loop in bdrv_drain with a BH, if called in
>> coroutine.
>>
>> For example, the bdrv_drain() in mirror.c can hang when a guest issued
>> request is pending on it in qemu_co_mutex_lock().
>>
>> Mirror coroutine in this case has just finished a request, and the block
>> job is about to complete. It calls bdrv_drain() which waits for the
>> other coroutine to complete. The other coroutine is a scsi-disk request.
>> The deadlock happens when the latter is in turn pending on the former to
>> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below
>> (assuming a qcow2 image):
>>
>>   mirror coroutine               scsi-disk coroutine
>>   -------------------------------------------------------------
>>   do last write
>>
>>     qcow2:qemu_co_mutex_lock()
>>     ...
>>                                  scsi disk read
>>
>>                                    tracked request begin
>>
>>                                    qcow2:qemu_co_mutex_lock.enter
>>
>>     qcow2:qemu_co_mutex_unlock()
>>
>>   bdrv_drain
>>     while (has tracked request)
>>       aio_poll()
>>
>> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return
>> because the mirror coroutine is blocked in the aio_poll(blocking=true).
>>
>> With this patch, the added qemu_coroutine_yield() allows the scsi-disk
>> coroutine to make progress as expected:
>>
>>   mirror coroutine               scsi-disk coroutine
>>   -------------------------------------------------------------
>>   do last write
>>
>>     qcow2:qemu_co_mutex_lock()
>>     ...
>>                                  scsi disk read
>>
>>                                    tracked request begin
>>
>>                                    qcow2:qemu_co_mutex_lock.enter
>>
>>     qcow2:qemu_co_mutex_unlock()
>>
>>   bdrv_drain.enter
>>>   schedule BH
>>>   qemu_coroutine_yield()
>>>                                  qcow2:qemu_co_mutex_lock.return
>>>                                  ...
>>                                    tracked request end
>>     ...
>>     (resumed from BH callback)
>>   bdrv_drain.return
>>   ...
>>
>> Reported-by: Laurent Vivier <lvivier@redhat.com>
>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Fam Zheng <famz@redhat.com>
>>
>> ---
>>
>> v2: Call qemu_bh_delete() in BH callback. [Paolo]
>>     Change the loop to an assertion. [Paolo]
>>     Elaborate a bit about the fix in commit log. [Paolo]
>> ---
>>  block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 41 insertions(+)
>>
>> diff --git a/block/io.c b/block/io.c
>> index c4869b9..ddcfb4c 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs)
>>      }
>>  }
>>  
>> +typedef struct {
>> +    Coroutine *co;
>> +    BlockDriverState *bs;
>> +    QEMUBH *bh;
>> +    bool done;
>> +} BdrvCoDrainData;
>> +
>> +static void bdrv_co_drain_bh_cb(void *opaque)
>> +{
>> +    BdrvCoDrainData *data = opaque;
>> +    Coroutine *co = data->co;
>> +
>> +    qemu_bh_delete(data->bh);
>> +    bdrv_drain(data->bs);
>> +    data->done = true;
>> +    qemu_coroutine_enter(co, NULL);
>> +}
>> +
>> +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs)
>> +{
> 
> Please document why a BH is needed:
> 
> /* Calling bdrv_drain() from a BH ensures the
>  * current coroutine yields and other coroutines run if they were
>  * queued from qemu_co_queue_run_restart().
>  */
> 
>> +    BdrvCoDrainData data;
>> +
>> +    assert(qemu_in_coroutine());
>> +    data = (BdrvCoDrainData) {
>> +        .co = qemu_coroutine_self(),
>> +        .bs = bs,
>> +        .done = false,
>> +        .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, &data),
>> +    };
>> +    qemu_bh_schedule(data.bh);
>> +
>> +    qemu_coroutine_yield();
>> +    /* If we are resumed from some other event (such as an aio completion or a
>> +     * timer callback), it is a bug in the caller that should be fixed. */
>> +    assert(data.done);
>> +}
>> +
>>  /*
>>   * Wait for pending requests to complete on a single BlockDriverState subtree,
>>   * and suspend block driver's internal I/O until next request arrives.
>> @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs)
>>      bool busy = true;
>>  
>>      bdrv_drain_recurse(bs);
>> +    if (qemu_in_coroutine()) {
>> +        bdrv_co_drain(bs);
>> +        return;
>> +    }
>>      while (busy) {
>>          /* Keep iterating */
>>           bdrv_flush_io_queue(bs);
>> -- 
>> 2.7.4
> 
> block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain()
> should assert(!qemu_in_coroutine()).
> 
> The reason why existing bdrv_read() and friends detect coroutine context
> at runtime is because it is meant for legacy code that runs in both
> coroutine and non-coroutine contexts.
> 
> Modern coroutine code coroutine code the coroutine interfaces explicitly
> instead.

For what it's worth, I suspect Fam's patch removes the need for
http://article.gmane.org/gmane.comp.emulators.qemu/401375.  That's a
nice bonus. :)

Paolo

next prev parent reply	other threads:[~2016-04-04 14:48 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-04-01 13:57 [Qemu-devel] [PATCH v2] block: Fix bdrv_drain in coroutine Fam Zheng
2016-04-01 14:14 ` Laurent Vivier
2016-04-04 11:57 ` Stefan Hajnoczi
2016-04-04 14:47   ` Paolo Bonzini [this message]
2016-04-05  1:27   ` Fam Zheng
2016-04-05  9:39     ` Stefan Hajnoczi
2016-04-05 11:15       ` Fam Zheng
2016-04-05 12:39         ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=57027E9C.9060101@redhat.com \
    --to=pbonzini@redhat.com \
    --cc=famz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=lvivier@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.