[Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin
@ 2017-04-18 14:30 Fam Zheng
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse Fam Zheng
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Fam Zheng @ 2017-04-18 14:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, pbonzini,
	Stefan Hajnoczi, jcody

v4: Split patch, and fix the unsafe bdrv_unref. [Paolo]

Fam Zheng (2):
  block: Walk bs->children carefully in bdrv_drain_recurse
  block: Drain BH in bdrv_drained_begin

 block/io.c            | 23 ++++++++++++++++++++---
 include/block/block.h | 22 ++++++++++++++--------
 2 files changed, 34 insertions(+), 11 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse
  2017-04-18 14:30 [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin Fam Zheng
@ 2017-04-18 14:30 ` Fam Zheng
  2017-04-18 14:46   ` Kevin Wolf
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin Fam Zheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Fam Zheng @ 2017-04-18 14:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, pbonzini,
	Stefan Hajnoczi, jcody

The recursive bdrv_drain_recurse may run a block job completion BH that
drops nodes. The coming changes will make that more likely and use-after-free
would happen without this patch

Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to
QLIST_FOREACH_SAFE to prevent such a case from happening.

Since bdrv_unref accesses global state that is not protected by the AioContext
lock, we cannot use bdrv_ref/bdrv_unref unconditionally.  Fortunately the
protection is not needed in IOThread because only main loop can modify a graph
with the AioContext lock held.

Signed-off-by: Fam Zheng <famz@redhat.com>
---
 block/io.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/block/io.c b/block/io.c
index 8706bfa..a0df8c4 100644
--- a/block/io.c
+++ b/block/io.c
@@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
 
 static bool bdrv_drain_recurse(BlockDriverState *bs)
 {
-    BdrvChild *child;
+    BdrvChild *child, *tmp;
     bool waited;
 
     waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
@@ -167,8 +167,25 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
         bs->drv->bdrv_drain(bs);
     }
 
-    QLIST_FOREACH(child, &bs->children, next) {
-        waited |= bdrv_drain_recurse(child->bs);
+    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
+        BlockDriverState *bs = child->bs;
+        bool in_main_loop =
+            qemu_get_current_aio_context() == qemu_get_aio_context();
+        assert(bs->refcnt > 0);
+        if (in_main_loop) {
+            /* In case the resursive bdrv_drain_recurse processes a
+             * block_job_defer_to_main_loop BH and modifies the graph,
+             * let's hold a reference to bs until we are done.
+             *
+             * IOThread doesn't have such a BH, and it is not safe to call
+             * bdrv_unref without BQL, so skip doing it there.
+             **/
+            bdrv_ref(bs);
+        }
+        waited |= bdrv_drain_recurse(bs);
+        if (in_main_loop) {
+            bdrv_unref(bs);
+        }
     }
 
     return waited;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2017-04-18 14:30 [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin Fam Zheng
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse Fam Zheng
@ 2017-04-18 14:30 ` Fam Zheng
  2018-02-06 15:32   ` Kevin Wolf
  2017-04-18 14:42 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] " Paolo Bonzini
  2017-04-18 14:51 ` Jeff Cody
  3 siblings, 1 reply; 13+ messages in thread
From: Fam Zheng @ 2017-04-18 14:30 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Fam Zheng, Kevin Wolf, Max Reitz, pbonzini,
	Stefan Hajnoczi, jcody

During block job completion, nothing is preventing
block_job_defer_to_main_loop_bh from being called in a nested
aio_poll(), which is a trouble, such as in this code path:

    qmp_block_commit
      commit_active_start
        bdrv_reopen
          bdrv_reopen_multiple
            bdrv_reopen_prepare
              bdrv_flush
                aio_poll
                  aio_bh_poll
                    aio_bh_call
                      block_job_defer_to_main_loop_bh
                        stream_complete
                          bdrv_reopen

block_job_defer_to_main_loop_bh is the last step of the stream job,
which should have been "paused" by the bdrv_drained_begin/end in
bdrv_reopen_multiple, but it is not done because it's in the form of a
main loop BH.

Similar to why block jobs should be paused between drained_begin and
drained_end, BHs they schedule must be excluded as well.  To achieve
this, this patch forces draining the BH in BDRV_POLL_WHILE.

As a side effect this fixes a hang in block_job_detach_aio_context
during system_reset when a block job is ready:

    #0  0x0000555555aa79f3 in bdrv_drain_recurse
    #1  0x0000555555aa825d in bdrv_drained_begin
    #2  0x0000555555aa8449 in bdrv_drain
    #3  0x0000555555a9c356 in blk_drain
    #4  0x0000555555aa3cfd in mirror_drain
    #5  0x0000555555a66e11 in block_job_detach_aio_context
    #6  0x0000555555a62f4d in bdrv_detach_aio_context
    #7  0x0000555555a63116 in bdrv_set_aio_context
    #8  0x0000555555a9d326 in blk_set_aio_context
    #9  0x00005555557e38da in virtio_blk_data_plane_stop
    #10 0x00005555559f9d5f in virtio_bus_stop_ioeventfd
    #11 0x00005555559fa49b in virtio_bus_stop_ioeventfd
    #12 0x00005555559f6a18 in virtio_pci_stop_ioeventfd
    #13 0x00005555559f6a18 in virtio_pci_reset
    #14 0x00005555559139a9 in qdev_reset_one
    #15 0x0000555555916738 in qbus_walk_children
    #16 0x0000555555913318 in qdev_walk_children
    #17 0x0000555555916738 in qbus_walk_children
    #18 0x00005555559168ca in qemu_devices_reset
    #19 0x000055555581fcbb in pc_machine_reset
    #20 0x00005555558a4d96 in qemu_system_reset
    #21 0x000055555577157a in main_loop_should_exit
    #22 0x000055555577157a in main_loop
    #23 0x000055555577157a in main

The rationale is that the loop in block_job_detach_aio_context cannot
make any progress in pausing/completing the job, because bs->in_flight
is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
BH. With this patch, it does.

Reported-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
---
 include/block/block.h | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 97d4330..5ddc0cf 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -381,12 +381,13 @@ void bdrv_drain_all(void);
 
 #define BDRV_POLL_WHILE(bs, cond) ({                       \
     bool waited_ = false;                                  \
+    bool busy_ = true;                                     \
     BlockDriverState *bs_ = (bs);                          \
     AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
     if (aio_context_in_iothread(ctx_)) {                   \
-        while ((cond)) {                                   \
-            aio_poll(ctx_, true);                          \
-            waited_ = true;                                \
+        while ((cond) || busy_) {                          \
+            busy_ = aio_poll(ctx_, (cond));                \
+            waited_ |= !!(cond) | busy_;                   \
         }                                                  \
     } else {                                               \
         assert(qemu_get_current_aio_context() ==           \
@@ -398,11 +399,16 @@ void bdrv_drain_all(void);
          */                                                \
         assert(!bs_->wakeup);                              \
         bs_->wakeup = true;                                \
-        while ((cond)) {                                   \
-            aio_context_release(ctx_);                     \
-            aio_poll(qemu_get_aio_context(), true);        \
-            aio_context_acquire(ctx_);                     \
-            waited_ = true;                                \
+        while (busy_) {                                    \
+            if ((cond)) {                                  \
+                waited_ = busy_ = true;                    \
+                aio_context_release(ctx_);                 \
+                aio_poll(qemu_get_aio_context(), true);    \
+                aio_context_acquire(ctx_);                 \
+            } else {                                       \
+                busy_ = aio_poll(ctx_, false);             \
+                waited_ |= busy_;                          \
+            }                                              \
         }                                                  \
         bs_->wakeup = false;                               \
     }                                                      \
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin
  2017-04-18 14:30 [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin Fam Zheng
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse Fam Zheng
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin Fam Zheng
@ 2017-04-18 14:42 ` Paolo Bonzini
  2017-04-18 14:51 ` Jeff Cody
  3 siblings, 0 replies; 13+ messages in thread
From: Paolo Bonzini @ 2017-04-18 14:42 UTC (permalink / raw)
  To: Fam Zheng, qemu-devel
  Cc: qemu-block, Kevin Wolf, Max Reitz, Stefan Hajnoczi, jcody



On 18/04/2017 16:30, Fam Zheng wrote:
> v4: Split patch, and fix the unsafe bdrv_unref. [Paolo]
> 
> Fam Zheng (2):
>   block: Walk bs->children carefully in bdrv_drain_recurse
>   block: Drain BH in bdrv_drained_begin
> 
>  block/io.c            | 23 ++++++++++++++++++++---
>  include/block/block.h | 22 ++++++++++++++--------
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 

Thanks, this looks good.

Paolo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse Fam Zheng
@ 2017-04-18 14:46   ` Kevin Wolf
  2017-04-18 14:54     ` Fam Zheng
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2017-04-18 14:46 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, qemu-block, Max Reitz, pbonzini, Stefan Hajnoczi,
	jcody

Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
> The recursive bdrv_drain_recurse may run a block job completion BH that
> drops nodes. The coming changes will make that more likely and use-after-free
> would happen without this patch
> 
> Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to
> QLIST_FOREACH_SAFE to prevent such a case from happening.
> 
> Since bdrv_unref accesses global state that is not protected by the AioContext
> lock, we cannot use bdrv_ref/bdrv_unref unconditionally.  Fortunately the
> protection is not needed in IOThread because only main loop can modify a graph
> with the AioContext lock held.
> 
> Signed-off-by: Fam Zheng <famz@redhat.com>
> ---
>  block/io.c | 23 ++++++++++++++++++++---
>  1 file changed, 20 insertions(+), 3 deletions(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 8706bfa..a0df8c4 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
>  
>  static bool bdrv_drain_recurse(BlockDriverState *bs)
>  {
> -    BdrvChild *child;
> +    BdrvChild *child, *tmp;
>      bool waited;
>  
>      waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
> @@ -167,8 +167,25 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
>          bs->drv->bdrv_drain(bs);
>      }
>  
> -    QLIST_FOREACH(child, &bs->children, next) {
> -        waited |= bdrv_drain_recurse(child->bs);
> +    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> +        BlockDriverState *bs = child->bs;
> +        bool in_main_loop =
> +            qemu_get_current_aio_context() == qemu_get_aio_context();
> +        assert(bs->refcnt > 0);
> +        if (in_main_loop) {
> +            /* In case the resursive bdrv_drain_recurse processes a

s/resursive/recursive/

> +             * block_job_defer_to_main_loop BH and modifies the graph,
> +             * let's hold a reference to bs until we are done.
> +             *
> +             * IOThread doesn't have such a BH, and it is not safe to call
> +             * bdrv_unref without BQL, so skip doing it there.
> +             **/

And **/ is unusual, too.

> +            bdrv_ref(bs);
> +        }
> +        waited |= bdrv_drain_recurse(bs);
> +        if (in_main_loop) {
> +            bdrv_unref(bs);
> +        }
>      }

Other than this, the series looks good to me.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin
  2017-04-18 14:30 [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin Fam Zheng
                   ` (2 preceding siblings ...)
  2017-04-18 14:42 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] " Paolo Bonzini
@ 2017-04-18 14:51 ` Jeff Cody
  3 siblings, 0 replies; 13+ messages in thread
From: Jeff Cody @ 2017-04-18 14:51 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, qemu-block, Kevin Wolf, Max Reitz, pbonzini,
	Stefan Hajnoczi

On Tue, Apr 18, 2017 at 10:30:42PM +0800, Fam Zheng wrote:
> v4: Split patch, and fix the unsafe bdrv_unref. [Paolo]
> 
> Fam Zheng (2):
>   block: Walk bs->children carefully in bdrv_drain_recurse
>   block: Drain BH in bdrv_drained_begin
> 
>  block/io.c            | 23 ++++++++++++++++++++---
>  include/block/block.h | 22 ++++++++++++++--------
>  2 files changed, 34 insertions(+), 11 deletions(-)
> 
> -- 
> 2.9.3
>

Reviewed-by: Jeff Cody <jcody@redhat.com>

(also Tested-by)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse
  2017-04-18 14:46   ` Kevin Wolf
@ 2017-04-18 14:54     ` Fam Zheng
  0 siblings, 0 replies; 13+ messages in thread
From: Fam Zheng @ 2017-04-18 14:54 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-devel, qemu-block, Max Reitz, pbonzini, Stefan Hajnoczi,
	jcody

On Tue, 04/18 16:46, Kevin Wolf wrote:
> Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
> > The recursive bdrv_drain_recurse may run a block job completion BH that
> > drops nodes. The coming changes will make that more likely and use-after-free
> > would happen without this patch
> > 
> > Stash the bs pointer and use bdrv_ref/bdrv_unref in addition to
> > QLIST_FOREACH_SAFE to prevent such a case from happening.
> > 
> > Since bdrv_unref accesses global state that is not protected by the AioContext
> > lock, we cannot use bdrv_ref/bdrv_unref unconditionally.  Fortunately the
> > protection is not needed in IOThread because only main loop can modify a graph
> > with the AioContext lock held.
> > 
> > Signed-off-by: Fam Zheng <famz@redhat.com>
> > ---
> >  block/io.c | 23 ++++++++++++++++++++---
> >  1 file changed, 20 insertions(+), 3 deletions(-)
> > 
> > diff --git a/block/io.c b/block/io.c
> > index 8706bfa..a0df8c4 100644
> > --- a/block/io.c
> > +++ b/block/io.c
> > @@ -158,7 +158,7 @@ bool bdrv_requests_pending(BlockDriverState *bs)
> >  
> >  static bool bdrv_drain_recurse(BlockDriverState *bs)
> >  {
> > -    BdrvChild *child;
> > +    BdrvChild *child, *tmp;
> >      bool waited;
> >  
> >      waited = BDRV_POLL_WHILE(bs, atomic_read(&bs->in_flight) > 0);
> > @@ -167,8 +167,25 @@ static bool bdrv_drain_recurse(BlockDriverState *bs)
> >          bs->drv->bdrv_drain(bs);
> >      }
> >  
> > -    QLIST_FOREACH(child, &bs->children, next) {
> > -        waited |= bdrv_drain_recurse(child->bs);
> > +    QLIST_FOREACH_SAFE(child, &bs->children, next, tmp) {
> > +        BlockDriverState *bs = child->bs;
> > +        bool in_main_loop =
> > +            qemu_get_current_aio_context() == qemu_get_aio_context();
> > +        assert(bs->refcnt > 0);
> > +        if (in_main_loop) {
> > +            /* In case the resursive bdrv_drain_recurse processes a
> 
> s/resursive/recursive/
> 
> > +             * block_job_defer_to_main_loop BH and modifies the graph,
> > +             * let's hold a reference to bs until we are done.
> > +             *
> > +             * IOThread doesn't have such a BH, and it is not safe to call
> > +             * bdrv_unref without BQL, so skip doing it there.
> > +             **/
> 
> And **/ is unusual, too.
> 
> > +            bdrv_ref(bs);
> > +        }
> > +        waited |= bdrv_drain_recurse(bs);
> > +        if (in_main_loop) {
> > +            bdrv_unref(bs);
> > +        }
> >      }
> 
> Other than this, the series looks good to me.

Thanks, I'll fix them and send a pull request to Peter.

Fam

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin Fam Zheng
@ 2018-02-06 15:32   ` Kevin Wolf
  2018-02-07  1:48     ` Fam Zheng
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2018-02-06 15:32 UTC (permalink / raw)
  To: Fam Zheng
  Cc: qemu-devel, qemu-block, Max Reitz, pbonzini, Stefan Hajnoczi,
	jcody

Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
> During block job completion, nothing is preventing
> block_job_defer_to_main_loop_bh from being called in a nested
> aio_poll(), which is a trouble, such as in this code path:
> 
>     qmp_block_commit
>       commit_active_start
>         bdrv_reopen
>           bdrv_reopen_multiple
>             bdrv_reopen_prepare
>               bdrv_flush
>                 aio_poll
>                   aio_bh_poll
>                     aio_bh_call
>                       block_job_defer_to_main_loop_bh
>                         stream_complete
>                           bdrv_reopen
> 
> block_job_defer_to_main_loop_bh is the last step of the stream job,
> which should have been "paused" by the bdrv_drained_begin/end in
> bdrv_reopen_multiple, but it is not done because it's in the form of a
> main loop BH.
> 
> Similar to why block jobs should be paused between drained_begin and
> drained_end, BHs they schedule must be excluded as well.  To achieve
> this, this patch forces draining the BH in BDRV_POLL_WHILE.
> 
> As a side effect this fixes a hang in block_job_detach_aio_context
> during system_reset when a block job is ready:
> 
>     #0  0x0000555555aa79f3 in bdrv_drain_recurse
>     #1  0x0000555555aa825d in bdrv_drained_begin
>     #2  0x0000555555aa8449 in bdrv_drain
>     #3  0x0000555555a9c356 in blk_drain
>     #4  0x0000555555aa3cfd in mirror_drain
>     #5  0x0000555555a66e11 in block_job_detach_aio_context
>     #6  0x0000555555a62f4d in bdrv_detach_aio_context
>     #7  0x0000555555a63116 in bdrv_set_aio_context
>     #8  0x0000555555a9d326 in blk_set_aio_context
>     #9  0x00005555557e38da in virtio_blk_data_plane_stop
>     #10 0x00005555559f9d5f in virtio_bus_stop_ioeventfd
>     #11 0x00005555559fa49b in virtio_bus_stop_ioeventfd
>     #12 0x00005555559f6a18 in virtio_pci_stop_ioeventfd
>     #13 0x00005555559f6a18 in virtio_pci_reset
>     #14 0x00005555559139a9 in qdev_reset_one
>     #15 0x0000555555916738 in qbus_walk_children
>     #16 0x0000555555913318 in qdev_walk_children
>     #17 0x0000555555916738 in qbus_walk_children
>     #18 0x00005555559168ca in qemu_devices_reset
>     #19 0x000055555581fcbb in pc_machine_reset
>     #20 0x00005555558a4d96 in qemu_system_reset
>     #21 0x000055555577157a in main_loop_should_exit
>     #22 0x000055555577157a in main_loop
>     #23 0x000055555577157a in main
> 
> The rationale is that the loop in block_job_detach_aio_context cannot
> make any progress in pausing/completing the job, because bs->in_flight
> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
> BH. With this patch, it does.
> 
> Reported-by: Jeff Cody <jcody@redhat.com>
> Signed-off-by: Fam Zheng <famz@redhat.com>

Fam, do you remember whether this was really only about drain? Because
in that case...

> diff --git a/include/block/block.h b/include/block/block.h
> index 97d4330..5ddc0cf 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -381,12 +381,13 @@ void bdrv_drain_all(void);
>  
>  #define BDRV_POLL_WHILE(bs, cond) ({                       \
>      bool waited_ = false;                                  \
> +    bool busy_ = true;                                     \
>      BlockDriverState *bs_ = (bs);                          \
>      AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
>      if (aio_context_in_iothread(ctx_)) {                   \
> -        while ((cond)) {                                   \
> -            aio_poll(ctx_, true);                          \
> -            waited_ = true;                                \
> +        while ((cond) || busy_) {                          \
> +            busy_ = aio_poll(ctx_, (cond));                \
> +            waited_ |= !!(cond) | busy_;                   \
>          }                                                  \
>      } else {                                               \
>          assert(qemu_get_current_aio_context() ==           \
> @@ -398,11 +399,16 @@ void bdrv_drain_all(void);
>           */                                                \
>          assert(!bs_->wakeup);                              \
>          bs_->wakeup = true;                                \
> -        while ((cond)) {                                   \
> -            aio_context_release(ctx_);                     \
> -            aio_poll(qemu_get_aio_context(), true);        \
> -            aio_context_acquire(ctx_);                     \
> -            waited_ = true;                                \
> +        while (busy_) {                                    \
> +            if ((cond)) {                                  \
> +                waited_ = busy_ = true;                    \
> +                aio_context_release(ctx_);                 \
> +                aio_poll(qemu_get_aio_context(), true);    \
> +                aio_context_acquire(ctx_);                 \
> +            } else {                                       \
> +                busy_ = aio_poll(ctx_, false);             \
> +                waited_ |= busy_;                          \
> +            }                                              \
>          }                                                  \
>          bs_->wakeup = false;                               \
>      }                                                      \

...fixing it in BDRV_POLL_WHILE() rather than the drain functions may
have been a bad idea.

I got a bug assigned where we have a large (200 GB) fragmented qcow2
image, and qemu-img convert takes two hours before it even starts to
copy data. What it does in this time is iterating through the whole
image with bdrv_block_status_above() in order to estimate the work to be
done (and of course it will repeat the same calls while actually copying
data).

One of the things I saw is that between each pair of lseek() calls, we
also have unnecessary poll() calls, and these were introduced by this
patch. If I modify bdrv_common_block_status_above() so that it doesn't
poll if data.done == true already, the time needed to iterate through my
test image is cut in half.

Now, of course, I'm still only seeing a few seconds for a 100 GB image,
so there must be more that is wrong for the reporter, but it suggests to
me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
one of its users actually needs this.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2018-02-06 15:32   ` Kevin Wolf
@ 2018-02-07  1:48     ` Fam Zheng
  2018-02-07 10:51       ` Kevin Wolf
  0 siblings, 1 reply; 13+ messages in thread
From: Fam Zheng @ 2018-02-07  1:48 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: QEMU Developers, qemu-block, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi, Jeffrey Cody

On Tue, Feb 6, 2018 at 11:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
>> During block job completion, nothing is preventing
>> block_job_defer_to_main_loop_bh from being called in a nested
>> aio_poll(), which is a trouble, such as in this code path:
>>
>>     qmp_block_commit
>>       commit_active_start
>>         bdrv_reopen
>>           bdrv_reopen_multiple
>>             bdrv_reopen_prepare
>>               bdrv_flush
>>                 aio_poll
>>                   aio_bh_poll
>>                     aio_bh_call
>>                       block_job_defer_to_main_loop_bh
>>                         stream_complete
>>                           bdrv_reopen
>>
>> block_job_defer_to_main_loop_bh is the last step of the stream job,
>> which should have been "paused" by the bdrv_drained_begin/end in
>> bdrv_reopen_multiple, but it is not done because it's in the form of a
>> main loop BH.
>>
>> Similar to why block jobs should be paused between drained_begin and
>> drained_end, BHs they schedule must be excluded as well.  To achieve
>> this, this patch forces draining the BH in BDRV_POLL_WHILE.
>>
>> As a side effect this fixes a hang in block_job_detach_aio_context
>> during system_reset when a block job is ready:
>>
>>     #0  0x0000555555aa79f3 in bdrv_drain_recurse
>>     #1  0x0000555555aa825d in bdrv_drained_begin
>>     #2  0x0000555555aa8449 in bdrv_drain
>>     #3  0x0000555555a9c356 in blk_drain
>>     #4  0x0000555555aa3cfd in mirror_drain
>>     #5  0x0000555555a66e11 in block_job_detach_aio_context
>>     #6  0x0000555555a62f4d in bdrv_detach_aio_context
>>     #7  0x0000555555a63116 in bdrv_set_aio_context
>>     #8  0x0000555555a9d326 in blk_set_aio_context
>>     #9  0x00005555557e38da in virtio_blk_data_plane_stop
>>     #10 0x00005555559f9d5f in virtio_bus_stop_ioeventfd
>>     #11 0x00005555559fa49b in virtio_bus_stop_ioeventfd
>>     #12 0x00005555559f6a18 in virtio_pci_stop_ioeventfd
>>     #13 0x00005555559f6a18 in virtio_pci_reset
>>     #14 0x00005555559139a9 in qdev_reset_one
>>     #15 0x0000555555916738 in qbus_walk_children
>>     #16 0x0000555555913318 in qdev_walk_children
>>     #17 0x0000555555916738 in qbus_walk_children
>>     #18 0x00005555559168ca in qemu_devices_reset
>>     #19 0x000055555581fcbb in pc_machine_reset
>>     #20 0x00005555558a4d96 in qemu_system_reset
>>     #21 0x000055555577157a in main_loop_should_exit
>>     #22 0x000055555577157a in main_loop
>>     #23 0x000055555577157a in main
>>
>> The rationale is that the loop in block_job_detach_aio_context cannot
>> make any progress in pausing/completing the job, because bs->in_flight
>> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
>> BH. With this patch, it does.
>>
>> Reported-by: Jeff Cody <jcody@redhat.com>
>> Signed-off-by: Fam Zheng <famz@redhat.com>
>
> Fam, do you remember whether this was really only about drain? Because
> in that case...

Yes I believe so.

>
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 97d4330..5ddc0cf 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -381,12 +381,13 @@ void bdrv_drain_all(void);
>>
>>  #define BDRV_POLL_WHILE(bs, cond) ({                       \
>>      bool waited_ = false;                                  \
>> +    bool busy_ = true;                                     \
>>      BlockDriverState *bs_ = (bs);                          \
>>      AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
>>      if (aio_context_in_iothread(ctx_)) {                   \
>> -        while ((cond)) {                                   \
>> -            aio_poll(ctx_, true);                          \
>> -            waited_ = true;                                \
>> +        while ((cond) || busy_) {                          \
>> +            busy_ = aio_poll(ctx_, (cond));                \
>> +            waited_ |= !!(cond) | busy_;                   \
>>          }                                                  \
>>      } else {                                               \
>>          assert(qemu_get_current_aio_context() ==           \
>> @@ -398,11 +399,16 @@ void bdrv_drain_all(void);
>>           */                                                \
>>          assert(!bs_->wakeup);                              \
>>          bs_->wakeup = true;                                \
>> -        while ((cond)) {                                   \
>> -            aio_context_release(ctx_);                     \
>> -            aio_poll(qemu_get_aio_context(), true);        \
>> -            aio_context_acquire(ctx_);                     \
>> -            waited_ = true;                                \
>> +        while (busy_) {                                    \
>> +            if ((cond)) {                                  \
>> +                waited_ = busy_ = true;                    \
>> +                aio_context_release(ctx_);                 \
>> +                aio_poll(qemu_get_aio_context(), true);    \
>> +                aio_context_acquire(ctx_);                 \
>> +            } else {                                       \
>> +                busy_ = aio_poll(ctx_, false);             \
>> +                waited_ |= busy_;                          \
>> +            }                                              \
>>          }                                                  \
>>          bs_->wakeup = false;                               \
>>      }                                                      \
>
> ...fixing it in BDRV_POLL_WHILE() rather than the drain functions may
> have been a bad idea.
>
> I got a bug assigned where we have a large (200 GB) fragmented qcow2
> image, and qemu-img convert takes two hours before it even starts to
> copy data. What it does in this time is iterating through the whole
> image with bdrv_block_status_above() in order to estimate the work to be
> done (and of course it will repeat the same calls while actually copying
> data).

The convert_iteration_sectors loop looks wasteful. Why cannot we
report progress simply based on (offset/size) * 100% so we don't need
to do this estimation?

>
> One of the things I saw is that between each pair of lseek() calls, we
> also have unnecessary poll() calls, and these were introduced by this
> patch. If I modify bdrv_common_block_status_above() so that it doesn't
> poll if data.done == true already, the time needed to iterate through my
> test image is cut in half.
>
> Now, of course, I'm still only seeing a few seconds for a 100 GB image,
> so there must be more that is wrong for the reporter, but it suggests to
> me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
> one of its users actually needs this.

Sounds fine to me. Maybe we could add a boolean parameter to BDRV_POLL_WHILE?

Fam

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2018-02-07  1:48     ` Fam Zheng
@ 2018-02-07 10:51       ` Kevin Wolf
  2018-02-07 12:39         ` Fam Zheng
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2018-02-07 10:51 UTC (permalink / raw)
  To: Fam Zheng
  Cc: QEMU Developers, qemu-block, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi, Jeffrey Cody

Am 07.02.2018 um 02:48 hat Fam Zheng geschrieben:
> On Tue, Feb 6, 2018 at 11:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
> >> During block job completion, nothing is preventing
> >> block_job_defer_to_main_loop_bh from being called in a nested
> >> aio_poll(), which is a trouble, such as in this code path:
> >>
> >>     qmp_block_commit
> >>       commit_active_start
> >>         bdrv_reopen
> >>           bdrv_reopen_multiple
> >>             bdrv_reopen_prepare
> >>               bdrv_flush
> >>                 aio_poll
> >>                   aio_bh_poll
> >>                     aio_bh_call
> >>                       block_job_defer_to_main_loop_bh
> >>                         stream_complete
> >>                           bdrv_reopen
> >>
> >> block_job_defer_to_main_loop_bh is the last step of the stream job,
> >> which should have been "paused" by the bdrv_drained_begin/end in
> >> bdrv_reopen_multiple, but it is not done because it's in the form of a
> >> main loop BH.
> >>
> >> Similar to why block jobs should be paused between drained_begin and
> >> drained_end, BHs they schedule must be excluded as well.  To achieve
> >> this, this patch forces draining the BH in BDRV_POLL_WHILE.
> >>
> >> As a side effect this fixes a hang in block_job_detach_aio_context
> >> during system_reset when a block job is ready:
> >>
> >>     #0  0x0000555555aa79f3 in bdrv_drain_recurse
> >>     #1  0x0000555555aa825d in bdrv_drained_begin
> >>     #2  0x0000555555aa8449 in bdrv_drain
> >>     #3  0x0000555555a9c356 in blk_drain
> >>     #4  0x0000555555aa3cfd in mirror_drain
> >>     #5  0x0000555555a66e11 in block_job_detach_aio_context
> >>     #6  0x0000555555a62f4d in bdrv_detach_aio_context
> >>     #7  0x0000555555a63116 in bdrv_set_aio_context
> >>     #8  0x0000555555a9d326 in blk_set_aio_context
> >>     #9  0x00005555557e38da in virtio_blk_data_plane_stop
> >>     #10 0x00005555559f9d5f in virtio_bus_stop_ioeventfd
> >>     #11 0x00005555559fa49b in virtio_bus_stop_ioeventfd
> >>     #12 0x00005555559f6a18 in virtio_pci_stop_ioeventfd
> >>     #13 0x00005555559f6a18 in virtio_pci_reset
> >>     #14 0x00005555559139a9 in qdev_reset_one
> >>     #15 0x0000555555916738 in qbus_walk_children
> >>     #16 0x0000555555913318 in qdev_walk_children
> >>     #17 0x0000555555916738 in qbus_walk_children
> >>     #18 0x00005555559168ca in qemu_devices_reset
> >>     #19 0x000055555581fcbb in pc_machine_reset
> >>     #20 0x00005555558a4d96 in qemu_system_reset
> >>     #21 0x000055555577157a in main_loop_should_exit
> >>     #22 0x000055555577157a in main_loop
> >>     #23 0x000055555577157a in main
> >>
> >> The rationale is that the loop in block_job_detach_aio_context cannot
> >> make any progress in pausing/completing the job, because bs->in_flight
> >> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
> >> BH. With this patch, it does.
> >>
> >> Reported-by: Jeff Cody <jcody@redhat.com>
> >> Signed-off-by: Fam Zheng <famz@redhat.com>
> >
> > Fam, do you remember whether this was really only about drain? Because
> > in that case...
> 
> Yes I believe so.
> 
> >
> >> diff --git a/include/block/block.h b/include/block/block.h
> >> index 97d4330..5ddc0cf 100644
> >> --- a/include/block/block.h
> >> +++ b/include/block/block.h
> >> @@ -381,12 +381,13 @@ void bdrv_drain_all(void);
> >>
> >>  #define BDRV_POLL_WHILE(bs, cond) ({                       \
> >>      bool waited_ = false;                                  \
> >> +    bool busy_ = true;                                     \
> >>      BlockDriverState *bs_ = (bs);                          \
> >>      AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
> >>      if (aio_context_in_iothread(ctx_)) {                   \
> >> -        while ((cond)) {                                   \
> >> -            aio_poll(ctx_, true);                          \
> >> -            waited_ = true;                                \
> >> +        while ((cond) || busy_) {                          \
> >> +            busy_ = aio_poll(ctx_, (cond));                \
> >> +            waited_ |= !!(cond) | busy_;                   \
> >>          }                                                  \
> >>      } else {                                               \
> >>          assert(qemu_get_current_aio_context() ==           \
> >> @@ -398,11 +399,16 @@ void bdrv_drain_all(void);
> >>           */                                                \
> >>          assert(!bs_->wakeup);                              \
> >>          bs_->wakeup = true;                                \
> >> -        while ((cond)) {                                   \
> >> -            aio_context_release(ctx_);                     \
> >> -            aio_poll(qemu_get_aio_context(), true);        \
> >> -            aio_context_acquire(ctx_);                     \
> >> -            waited_ = true;                                \
> >> +        while (busy_) {                                    \
> >> +            if ((cond)) {                                  \
> >> +                waited_ = busy_ = true;                    \
> >> +                aio_context_release(ctx_);                 \
> >> +                aio_poll(qemu_get_aio_context(), true);    \
> >> +                aio_context_acquire(ctx_);                 \
> >> +            } else {                                       \
> >> +                busy_ = aio_poll(ctx_, false);             \
> >> +                waited_ |= busy_;                          \
> >> +            }                                              \
> >>          }                                                  \
> >>          bs_->wakeup = false;                               \
> >>      }                                                      \
> >
> > ...fixing it in BDRV_POLL_WHILE() rather than the drain functions may
> > have been a bad idea.
> >
> > I got a bug assigned where we have a large (200 GB) fragmented qcow2
> > image, and qemu-img convert takes two hours before it even starts to
> > copy data. What it does in this time is iterating through the whole
> > image with bdrv_block_status_above() in order to estimate the work to be
> > done (and of course it will repeat the same calls while actually copying
> > data).
> 
> The convert_iteration_sectors loop looks wasteful. Why cannot we
> report progress simply based on (offset/size) * 100% so we don't need
> to do this estimation?

The assumption is that it's quick and it makes the progress much more
meaningful. You know those progress bars that slowly crawl towards 50%
and then suddenly complete within a second. Or the first 20% are quick,
but then things get really slow. They are not a great example.

There must have been something wrong with that image file that was
reported, because they reported that it was the only image causing
trouble and if they copied it away, it became quick, too.

Even for a maximally fragmented 100 GB qcow2 image it only takes about a
second on my laptop. So while I don't feel as certain about the loop as
before, in practice it normally doesn't seem to hurt.

> >
> > One of the things I saw is that between each pair of lseek() calls, we
> > also have unnecessary poll() calls, and these were introduced by this
> > patch. If I modify bdrv_common_block_status_above() so that it doesn't
> > poll if data.done == true already, the time needed to iterate through my
> > test image is cut in half.
> >
> > Now, of course, I'm still only seeing a few seconds for a 100 GB image,
> > so there must be more that is wrong for the reporter, but it suggests to
> > me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
> > one of its users actually needs this.
> 
> Sounds fine to me. Maybe we could add a boolean parameter to
> BDRV_POLL_WHILE?

Why not call aio_poll() explicitly in the BDRV_POLL_WHILE() condition in
the one place that needs it?

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2018-02-07 10:51       ` Kevin Wolf
@ 2018-02-07 12:39         ` Fam Zheng
  2018-02-07 13:10           ` Kevin Wolf
  0 siblings, 1 reply; 13+ messages in thread
From: Fam Zheng @ 2018-02-07 12:39 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: QEMU Developers, qemu-block, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi, Jeffrey Cody

On Wed, Feb 7, 2018 at 6:51 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 07.02.2018 um 02:48 hat Fam Zheng geschrieben:
>> On Tue, Feb 6, 2018 at 11:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 18.04.2017 um 16:30 hat Fam Zheng geschrieben:
>> >> During block job completion, nothing is preventing
>> >> block_job_defer_to_main_loop_bh from being called in a nested
>> >> aio_poll(), which is a trouble, such as in this code path:
>> >>
>> >>     qmp_block_commit
>> >>       commit_active_start
>> >>         bdrv_reopen
>> >>           bdrv_reopen_multiple
>> >>             bdrv_reopen_prepare
>> >>               bdrv_flush
>> >>                 aio_poll
>> >>                   aio_bh_poll
>> >>                     aio_bh_call
>> >>                       block_job_defer_to_main_loop_bh
>> >>                         stream_complete
>> >>                           bdrv_reopen
>> >>
>> >> block_job_defer_to_main_loop_bh is the last step of the stream job,
>> >> which should have been "paused" by the bdrv_drained_begin/end in
>> >> bdrv_reopen_multiple, but it is not done because it's in the form of a
>> >> main loop BH.
>> >>
>> >> Similar to why block jobs should be paused between drained_begin and
>> >> drained_end, BHs they schedule must be excluded as well.  To achieve
>> >> this, this patch forces draining the BH in BDRV_POLL_WHILE.
>> >>
>> >> As a side effect this fixes a hang in block_job_detach_aio_context
>> >> during system_reset when a block job is ready:
>> >>
>> >>     #0  0x0000555555aa79f3 in bdrv_drain_recurse
>> >>     #1  0x0000555555aa825d in bdrv_drained_begin
>> >>     #2  0x0000555555aa8449 in bdrv_drain
>> >>     #3  0x0000555555a9c356 in blk_drain
>> >>     #4  0x0000555555aa3cfd in mirror_drain
>> >>     #5  0x0000555555a66e11 in block_job_detach_aio_context
>> >>     #6  0x0000555555a62f4d in bdrv_detach_aio_context
>> >>     #7  0x0000555555a63116 in bdrv_set_aio_context
>> >>     #8  0x0000555555a9d326 in blk_set_aio_context
>> >>     #9  0x00005555557e38da in virtio_blk_data_plane_stop
>> >>     #10 0x00005555559f9d5f in virtio_bus_stop_ioeventfd
>> >>     #11 0x00005555559fa49b in virtio_bus_stop_ioeventfd
>> >>     #12 0x00005555559f6a18 in virtio_pci_stop_ioeventfd
>> >>     #13 0x00005555559f6a18 in virtio_pci_reset
>> >>     #14 0x00005555559139a9 in qdev_reset_one
>> >>     #15 0x0000555555916738 in qbus_walk_children
>> >>     #16 0x0000555555913318 in qdev_walk_children
>> >>     #17 0x0000555555916738 in qbus_walk_children
>> >>     #18 0x00005555559168ca in qemu_devices_reset
>> >>     #19 0x000055555581fcbb in pc_machine_reset
>> >>     #20 0x00005555558a4d96 in qemu_system_reset
>> >>     #21 0x000055555577157a in main_loop_should_exit
>> >>     #22 0x000055555577157a in main_loop
>> >>     #23 0x000055555577157a in main
>> >>
>> >> The rationale is that the loop in block_job_detach_aio_context cannot
>> >> make any progress in pausing/completing the job, because bs->in_flight
>> >> is 0, so bdrv_drain doesn't process the block_job_defer_to_main_loop
>> >> BH. With this patch, it does.
>> >>
>> >> Reported-by: Jeff Cody <jcody@redhat.com>
>> >> Signed-off-by: Fam Zheng <famz@redhat.com>
>> >
>> > Fam, do you remember whether this was really only about drain? Because
>> > in that case...
>>
>> Yes I believe so.
>>
>> >
>> >> diff --git a/include/block/block.h b/include/block/block.h
>> >> index 97d4330..5ddc0cf 100644
>> >> --- a/include/block/block.h
>> >> +++ b/include/block/block.h
>> >> @@ -381,12 +381,13 @@ void bdrv_drain_all(void);
>> >>
>> >>  #define BDRV_POLL_WHILE(bs, cond) ({                       \
>> >>      bool waited_ = false;                                  \
>> >> +    bool busy_ = true;                                     \
>> >>      BlockDriverState *bs_ = (bs);                          \
>> >>      AioContext *ctx_ = bdrv_get_aio_context(bs_);          \
>> >>      if (aio_context_in_iothread(ctx_)) {                   \
>> >> -        while ((cond)) {                                   \
>> >> -            aio_poll(ctx_, true);                          \
>> >> -            waited_ = true;                                \
>> >> +        while ((cond) || busy_) {                          \
>> >> +            busy_ = aio_poll(ctx_, (cond));                \
>> >> +            waited_ |= !!(cond) | busy_;                   \
>> >>          }                                                  \
>> >>      } else {                                               \
>> >>          assert(qemu_get_current_aio_context() ==           \
>> >> @@ -398,11 +399,16 @@ void bdrv_drain_all(void);
>> >>           */                                                \
>> >>          assert(!bs_->wakeup);                              \
>> >>          bs_->wakeup = true;                                \
>> >> -        while ((cond)) {                                   \
>> >> -            aio_context_release(ctx_);                     \
>> >> -            aio_poll(qemu_get_aio_context(), true);        \
>> >> -            aio_context_acquire(ctx_);                     \
>> >> -            waited_ = true;                                \
>> >> +        while (busy_) {                                    \
>> >> +            if ((cond)) {                                  \
>> >> +                waited_ = busy_ = true;                    \
>> >> +                aio_context_release(ctx_);                 \
>> >> +                aio_poll(qemu_get_aio_context(), true);    \
>> >> +                aio_context_acquire(ctx_);                 \
>> >> +            } else {                                       \
>> >> +                busy_ = aio_poll(ctx_, false);             \
>> >> +                waited_ |= busy_;                          \
>> >> +            }                                              \
>> >>          }                                                  \
>> >>          bs_->wakeup = false;                               \
>> >>      }                                                      \
>> >
>> > ...fixing it in BDRV_POLL_WHILE() rather than the drain functions may
>> > have been a bad idea.
>> >
>> > I got a bug assigned where we have a large (200 GB) fragmented qcow2
>> > image, and qemu-img convert takes two hours before it even starts to
>> > copy data. What it does in this time is iterating through the whole
>> > image with bdrv_block_status_above() in order to estimate the work to be
>> > done (and of course it will repeat the same calls while actually copying
>> > data).
>>
>> The convert_iteration_sectors loop looks wasteful. Why cannot we
>> report progress simply based on (offset/size) * 100% so we don't need
>> to do this estimation?
>
> The assumption is that it's quick and it makes the progress much more
> meaningful. You know those progress bars that slowly crawl towards 50%
> and then suddenly complete within a second. Or the first 20% are quick,
> but then things get really slow. They are not a great example.
>
> There must have been something wrong with that image file that was
> reported, because they reported that it was the only image causing
> trouble and if they copied it away, it became quick, too.
>
> Even for a maximally fragmented 100 GB qcow2 image it only takes about a
> second on my laptop. So while I don't feel as certain about the loop as
> before, in practice it normally doesn't seem to hurt.

No doubt about normal cases. I was unsure about corner cases like
slow-ish NFS etc.

A little bit of intelligence would be limiting the time for the loop
to a few seconds, for example, "IMG_CVT_EST_SCALE *
bdrv_getlength(bs)", or a less linear map.

>
>> >
>> > One of the things I saw is that between each pair of lseek() calls, we
>> > also have unnecessary poll() calls, and these were introduced by this
>> > patch. If I modify bdrv_common_block_status_above() so that it doesn't
>> > poll if data.done == true already, the time needed to iterate through my
>> > test image is cut in half.
>> >
>> > Now, of course, I'm still only seeing a few seconds for a 100 GB image,
>> > so there must be more that is wrong for the reporter, but it suggests to
>> > me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
>> > one of its users actually needs this.
>>
>> Sounds fine to me. Maybe we could add a boolean parameter to
>> BDRV_POLL_WHILE?
>
> Why not call aio_poll() explicitly in the BDRV_POLL_WHILE() condition in
> the one place that needs it?

Yes, that is better. Do you have a patch? Or do you want me to work on one?

Fam

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2018-02-07 12:39         ` Fam Zheng
@ 2018-02-07 13:10           ` Kevin Wolf
  2018-02-07 14:14             ` Fam Zheng
  0 siblings, 1 reply; 13+ messages in thread
From: Kevin Wolf @ 2018-02-07 13:10 UTC (permalink / raw)
  To: Fam Zheng
  Cc: QEMU Developers, qemu-block, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi, Jeffrey Cody

Am 07.02.2018 um 13:39 hat Fam Zheng geschrieben:
> On Wed, Feb 7, 2018 at 6:51 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> > Am 07.02.2018 um 02:48 hat Fam Zheng geschrieben:
> >> On Tue, Feb 6, 2018 at 11:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> >> > I got a bug assigned where we have a large (200 GB) fragmented qcow2
> >> > image, and qemu-img convert takes two hours before it even starts to
> >> > copy data. What it does in this time is iterating through the whole
> >> > image with bdrv_block_status_above() in order to estimate the work to be
> >> > done (and of course it will repeat the same calls while actually copying
> >> > data).
> >>
> >> The convert_iteration_sectors loop looks wasteful. Why cannot we
> >> report progress simply based on (offset/size) * 100% so we don't need
> >> to do this estimation?
> >
> > The assumption is that it's quick and it makes the progress much more
> > meaningful. You know those progress bars that slowly crawl towards 50%
> > and then suddenly complete within a second. Or the first 20% are quick,
> > but then things get really slow. They are not a great example.
> >
> > There must have been something wrong with that image file that was
> > reported, because they reported that it was the only image causing
> > trouble and if they copied it away, it became quick, too.
> >
> > Even for a maximally fragmented 100 GB qcow2 image it only takes about a
> > second on my laptop. So while I don't feel as certain about the loop as
> > before, in practice it normally doesn't seem to hurt.
> 
> No doubt about normal cases. I was unsure about corner cases like
> slow-ish NFS etc.

Yeah, NFS seems to be a bit slow. Not two-hours-slow, but when I tried
it with a localhost NFS server, the same operation that took two seconds
directly from the local file system took about 40s over NFS. They seem
to go over the network for each lseek() instead of caching the file map.
Maybe something to fix in the NFS kernel driver.

> A little bit of intelligence would be limiting the time for the loop
> to a few seconds, for example, "IMG_CVT_EST_SCALE *
> bdrv_getlength(bs)", or a less linear map.

I don't understand. What would IMG_CVT_EST_SCALE be? Isn't the problem
that this isn't a constant factor but can be anywhere between 0% and
100% depending on the specific image?

> >> > One of the things I saw is that between each pair of lseek() calls, we
> >> > also have unnecessary poll() calls, and these were introduced by this
> >> > patch. If I modify bdrv_common_block_status_above() so that it doesn't
> >> > poll if data.done == true already, the time needed to iterate through my
> >> > test image is cut in half.
> >> >
> >> > Now, of course, I'm still only seeing a few seconds for a 100 GB image,
> >> > so there must be more that is wrong for the reporter, but it suggests to
> >> > me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
> >> > one of its users actually needs this.
> >>
> >> Sounds fine to me. Maybe we could add a boolean parameter to
> >> BDRV_POLL_WHILE?
> >
> > Why not call aio_poll() explicitly in the BDRV_POLL_WHILE() condition in
> > the one place that needs it?
> 
> Yes, that is better. Do you have a patch? Or do you want me to work on one?

I don't have a patch yet. If you want to write one, be my guest.

Otherwise I'd just take a to-do note for when I get back to my
bdrv_drain work. There is still one or two series left to do anyway, and
I think it would fit in there.

Kevin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin
  2018-02-07 13:10           ` Kevin Wolf
@ 2018-02-07 14:14             ` Fam Zheng
  0 siblings, 0 replies; 13+ messages in thread
From: Fam Zheng @ 2018-02-07 14:14 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: QEMU Developers, qemu-block, Max Reitz, Paolo Bonzini,
	Stefan Hajnoczi, Jeffrey Cody

On Wed, Feb 7, 2018 at 9:10 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 07.02.2018 um 13:39 hat Fam Zheng geschrieben:
>> On Wed, Feb 7, 2018 at 6:51 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> > Am 07.02.2018 um 02:48 hat Fam Zheng geschrieben:
>> >> On Tue, Feb 6, 2018 at 11:32 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> >> > I got a bug assigned where we have a large (200 GB) fragmented qcow2
>> >> > image, and qemu-img convert takes two hours before it even starts to
>> >> > copy data. What it does in this time is iterating through the whole
>> >> > image with bdrv_block_status_above() in order to estimate the work to be
>> >> > done (and of course it will repeat the same calls while actually copying
>> >> > data).
>> >>
>> >> The convert_iteration_sectors loop looks wasteful. Why cannot we
>> >> report progress simply based on (offset/size) * 100% so we don't need
>> >> to do this estimation?
>> >
>> > The assumption is that it's quick and it makes the progress much more
>> > meaningful. You know those progress bars that slowly crawl towards 50%
>> > and then suddenly complete within a second. Or the first 20% are quick,
>> > but then things get really slow. They are not a great example.
>> >
>> > There must have been something wrong with that image file that was
>> > reported, because they reported that it was the only image causing
>> > trouble and if they copied it away, it became quick, too.
>> >
>> > Even for a maximally fragmented 100 GB qcow2 image it only takes about a
>> > second on my laptop. So while I don't feel as certain about the loop as
>> > before, in practice it normally doesn't seem to hurt.
>>
>> No doubt about normal cases. I was unsure about corner cases like
>> slow-ish NFS etc.
>
> Yeah, NFS seems to be a bit slow. Not two-hours-slow, but when I tried
> it with a localhost NFS server, the same operation that took two seconds
> directly from the local file system took about 40s over NFS. They seem
> to go over the network for each lseek() instead of caching the file map.
> Maybe something to fix in the NFS kernel driver.
>
>> A little bit of intelligence would be limiting the time for the loop
>> to a few seconds, for example, "IMG_CVT_EST_SCALE *
>> bdrv_getlength(bs)", or a less linear map.
>
> I don't understand. What would IMG_CVT_EST_SCALE be? Isn't the problem
> that this isn't a constant factor but can be anywhere between 0% and
> 100% depending on the specific image?

This is going to be quite arbitrary, just to make sure we don't waste
a very long time on maximal fragmented images, or on slow lseek()
scenarios. Something like this:

#define IMG_CVT_EST_SCALE ((1 << 40) / 30)

    time_t start = time(NULL);
    while (sector_num < s->total_sectors) {
        if (time(NULL) - start > MAX(30, bdrv_getlength() /
IMG_CVT_EST_SCALE)) {
            /* Too much time spent on counting allocation, just fall
back to bdrv_get_allocated_file_size */
            s->allocated_sectors = bdrv_get_allocated_file_size(bs);
            break;
        }
        n = convert_iteration_sectors(s, sector_num);
        ...
    }

So we loop for at most 30 seconds (for >1TB images).

>
>> >> > One of the things I saw is that between each pair of lseek() calls, we
>> >> > also have unnecessary poll() calls, and these were introduced by this
>> >> > patch. If I modify bdrv_common_block_status_above() so that it doesn't
>> >> > poll if data.done == true already, the time needed to iterate through my
>> >> > test image is cut in half.
>> >> >
>> >> > Now, of course, I'm still only seeing a few seconds for a 100 GB image,
>> >> > so there must be more that is wrong for the reporter, but it suggests to
>> >> > me that BDRV_POLL_WHILE() shouldn't be polling unconditionally when only
>> >> > one of its users actually needs this.
>> >>
>> >> Sounds fine to me. Maybe we could add a boolean parameter to
>> >> BDRV_POLL_WHILE?
>> >
>> > Why not call aio_poll() explicitly in the BDRV_POLL_WHILE() condition in
>> > the one place that needs it?
>>
>> Yes, that is better. Do you have a patch? Or do you want me to work on one?
>
> I don't have a patch yet. If you want to write one, be my guest.
>
> Otherwise I'd just take a to-do note for when I get back to my
> bdrv_drain work. There is still one or two series left to do anyway, and
> I think it would fit in there.

Thought that so I asked, I'll leave it to you then. :)

Fam

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-02-07 14:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-04-18 14:30 [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] block: Drain BH in bdrv_drained_begin Fam Zheng
2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 1/2] block: Walk bs->children carefully in bdrv_drain_recurse Fam Zheng
2017-04-18 14:46   ` Kevin Wolf
2017-04-18 14:54     ` Fam Zheng
2017-04-18 14:30 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 2/2] block: Drain BH in bdrv_drained_begin Fam Zheng
2018-02-06 15:32   ` Kevin Wolf
2018-02-07  1:48     ` Fam Zheng
2018-02-07 10:51       ` Kevin Wolf
2018-02-07 12:39         ` Fam Zheng
2018-02-07 13:10           ` Kevin Wolf
2018-02-07 14:14             ` Fam Zheng
2017-04-18 14:42 ` [Qemu-devel] [PATCH for-2.9-rc5 v4 0/2] " Paolo Bonzini
2017-04-18 14:51 ` Jeff Cody

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).