All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: Fam Zheng <famz@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
	Max Reitz <mreitz@redhat.com>,
	pbonzini@redhat.com, Stefan Hajnoczi <stefanha@redhat.com>,
	jcody@redhat.com
Subject: Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
Date: Tue, 6 Feb 2018 16:32:37 +0100	[thread overview]
Message-ID: <20180206153237.GA4799@localhost.localdomain> (raw)
In-Reply-To: <20170418143044.12187-3-famz@redhat.com>

Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
> During block job completion, nothing is preventing
> block_job_defer_to_main_loop_bh from being called in a nested
> aio_poll(), which is a trouble, such as in this code path:
> 
>     qmp_block_commit
>       commit_active_start
>         bdrv_reopen
>           bdrv_reopen_multiple
>             bdrv_reopen_prepare
>               bdrv_flush
>                 aio_poll
>                   aio_bh_poll
>                     aio_bh_call
>                       block_job_defer_to_main_loop_bh
>                         stream_complete
>                           bdrv_reopen
> 
> block_job_defer_to_main_loop_bh is the last step of the stream job,
> which should have been "paused" by the bdrv_drained_begin/end in
> bdrv_reopen_multiple, but it is not done because it's in the form of a
> main loop BH.
> 
> Similar to why block jobs should be paused between drained_begin and
> drained_end, BHs they schedule must be excluded as well.  To achieve
> this, this patch forces draining the BH in BDRV_POLL_WHILE.
> 
> As a side effect this fixes a hang in block_job_detach_aio_context
> during system_reset when a block job is ready:
> 
>     #0  0x0000555555aa79f3 in bdrv_drain_recurse
>     #1  0x0000555555aa825d in bdrv_drained_begin
>     #2  0x0000555555aa8449 in bdrv_drain
>     #3  0x0000555555a9c356 in blk_drain
>     #4  0x0000555555aa3cfd in mirror_drain
>     #5  0x0000555555a66e11 in block_job_detach_aio_context
>     #6  0x0000555555a62f4d in bdrv_detach_aio_context
>     #7  0x0000555555a63116 in bdrv_set_aio_context
>     #8  0x0000555555a9d326 in blk_set_aio_context
>     #9  0x00005555557e38da in virtio_blk_data_plane_stop
>     #10 0x00005555559f9d5f in virtio_bus_stop_ioeventfd
>     #11 0x00005555559fa49b in virtio_bus_stop_ioeventfd
>     #12 0x00005555559f6a18 in virtio_pci_stop_ioeventfd
>     #13 0x00005555559f6a18 in virtio_pci_reset
>     #14 0x00005555559139a9 in qdev_reset_one
>     #15 0x0000555555916738 in qbus_walk_children
>     #16 0x0000555555913318 in qdev_walk_children
>     #17 0x0000555555916738 in qbus_walk_children
>     #18 0x00005555559168ca in qemu_devices_reset
>     #19 0x000055555581fcbb in pc_machine_reset
>     #20 0x00005555558a4d96 in qemu_system_reset
>     #21 0x000055555577157a in main_loop_should_exit
>     #22 0x000055555577157a in main_loop
>     #23 0x000055555577157a in main
> 
> The rationale is that the loop in block_job_detach_aio_context cannot
> make any progress in pausing/completing the job, because bs->in_flight
> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
> BH. With this patch, it does.
> 
> Reported-by: Jeff Cody <jcody@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>

Fam, do you remember whether this was really only about drain? Because
in that case...

> diff --git a/include/block/block.h b/include/block/block.h
> index 97d4330..5ddc0cf 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -381,12 +381,13 @@ void bdrv_drain_all(void);
>  
>  #define BDRV_POLL_WHILE(bs, cond) ({                       \
>      bool waited_ = false;                                  \
> +    bool busy_ = true;                                     \
>      BlockDriverState *bs_ = (bs);                          \
>      AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
>      if (aio_context_in_iothread(ctx_)) {                   \
> -        while ((cond)) {                                   \
> -            aio_poll(ctx_, true);                          \
> -            waited_ = true;                                \
> +        while ((cond) || busy_) {                          \
> +            busy_ = aio_poll(ctx_, (cond));                \
> +            waited_ |= !!(cond) | busy_;                   \
>          }                                                  \
>      } else {                                               \
>          assert(qemu_get_current_aio_context() ==           \
> @@ -398,11 +399,16 @@ void bdrv_drain_all(void);
>           */                                                \
>          assert(!bs_->wakeup);                              \
>          bs_->wakeup = true;                                \
> -        while ((cond)) {                                   \
> -            aio_context_release(ctx_);                     \
> -            aio_poll(qemu_get_aio_context(), true);        \
> -            aio_context_acquire(ctx_);                     \
> -            waited_ = true;                                \
> +        while (busy_) {                                    \
> +            if ((cond)) {                                  \
> +                waited_ = busy_ = true;                    \
> +                aio_context_release(ctx_);                 \
> +                aio_poll(qemu_get_aio_context(), true);    \
> +                aio_context_acquire(ctx_);                 \
> +            } else {                                       \
> +                busy_ = aio_poll(ctx_, false);             \
> +                waited_ |= busy_;                          \
> +            }                                              \
>          }                                                  \
>          bs_->wakeup = false;                               \
>      }                                                      \

...fixing it in BDRV_POLL_WHILE() rather than the drain functions may
have been a bad idea.

I got a bug assigned where we have a large (200 GB) fragmented qcow2
image, and qemu-img convert takes two hours before it even starts to
copy data. What it does in this time is iterating through the whole
image with bdrv_block_status_above() in order to estimate the work to be
done (and of course it will repeat the same calls while actually copying
data).

One of the things I saw is that between each pair of lseek() calls, we
also have unnecessary poll() calls, and these were introduced by this
patch. If I modify bdrv_common_block_status_above() so that it doesn't
poll if data.done == true already, the time needed to iterate through my
test image is cut in half.

Now, of course, I'm still only seeing a few seconds for a 100 GB image,
so there must be more that is wrong for the reporter, but it suggests to
me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
one of its users actually needs this.

Kevin

  reply	other threads:[~2018-02-06 15:33 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-18 14:30 [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin Fam Zheng
2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse Fam Zheng
2017-04-18 14:46   ` Kevin Wolf
2017-04-18 14:54     ` Fam Zheng
2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin Fam Zheng
2018-02-06 15:32   ` Kevin Wolf [this message]
2018-02-07  1:48     ` Fam Zheng
2018-02-07 10:51       ` Kevin Wolf
2018-02-07 12:39         ` Fam Zheng
2018-02-07 13:10           ` Kevin Wolf
2018-02-07 14:14             ` Fam Zheng
2017-04-18 14:42 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] " Paolo Bonzini
2017-04-18 14:51 ` Jeff Cody

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180206153237.GA4799@localhost.localdomain \
    --to=kwolf@redhat.com \
    --cc=famz@redhat.com \
    --cc=jcody@redhat.com \
    --cc=mreitz@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.