[PATCH v2 0/2] qcow2: seriously improve savevm performance

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] qcow2: seriously improve savevm performance
@ 2020-06-10 19:00 Denis V. Lunev
  2020-06-10 19:00 ` [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine Denis V. Lunev
  2020-06-10 19:00 ` [PATCH 2/2] qcow2: improve savevm performance Denis V. Lunev
  0 siblings, 2 replies; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 19:00 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V . Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
                original     rfc    this
cached:          1.79s      2.38s   1.27s
non-cached:      3.29s      1.31s   0.81s

Changes from v1:
- patchew warning fixed
- fixed validation that only 1 waiter is allowed in patch 1

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine
  2020-06-10 19:00 [PATCH v2 0/2] qcow2: seriously improve savevm performance Denis V. Lunev
@ 2020-06-10 19:00 ` Denis V. Lunev
  2020-06-11  7:36   ` [PATCH] block/aio_task: allow start/wait task from any coroutine Vladimir Sementsov-Ogievskiy
  2020-06-10 19:00 ` [PATCH 2/2] qcow2: improve savevm performance Denis V. Lunev
  1 sibling, 1 reply; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 19:00 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V. Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

The patch preserves the constraint that the only waiter is allowed.

The patch renames AioTaskPool->main_co to wake_co and removes
AioTaskPool->waiting flag. wake_co keeps coroutine, which is
waiting for wakeup on worker completion. Thus 'waiting' flag
in this semantics is equivalent to 'wake_co != NULL'.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..5183b0729d 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-    Coroutine *main_co;
+    Coroutine *wake_co;
     int status;
     int max_busy_tasks;
     int busy_tasks;
-    bool waiting;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,21 +51,21 @@ static void coroutine_fn aio_task_co(void *opaque)
 
     g_free(task);
 
-    if (pool->waiting) {
-        pool->waiting = false;
-        aio_co_wake(pool->main_co);
+    if (pool->wake_co != NULL) {
+        aio_co_wake(pool->wake_co);
+        pool->wake_co = NULL;
     }
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
     assert(pool->busy_tasks > 0);
-    assert(qemu_coroutine_self() == pool->main_co);
+    assert(pool->wake_co == NULL);
 
-    pool->waiting = true;
+    pool->wake_co = qemu_coroutine_self();
     qemu_coroutine_yield();
 
-    assert(!pool->waiting);
+    assert(pool->wake_co == NULL);
     assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
@@ -98,7 +97,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
 {
     AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-    pool->main_co = qemu_coroutine_self();
+    pool->wake_co = NULL;
     pool->max_busy_tasks = max_busy_tasks;
 
     return pool;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH] block/aio_task: allow start/wait task from any coroutine
  2020-06-10 19:00 ` [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine Denis V. Lunev
@ 2020-06-11  7:36   ` Vladimir Sementsov-Ogievskiy
  2020-06-11 12:31     ` Denis V. Lunev
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-11  7:36 UTC (permalink / raw)
  To: qemu-block, qemu-devel; +Cc: kwolf, den, vsementsov, dplotnikov, mreitz

Currently, aio task pool assumes that there is a main coroutine, which
creates tasks and wait for them. Let's remove the restriction by using
CoQueue. Code becomes clearer, interface more obvious.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

Hi! Here is my counter-propasal for
"[PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine"
by Denis. I'm sure that if we are going to change something here, better
is make the interface work from any coroutine without the restriction of
only-on-waiter at the moment.

(Note, that it is still not thread-safe)


 block/aio_task.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..d48b29ff83 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-    Coroutine *main_co;
     int status;
     int max_busy_tasks;
     int busy_tasks;
-    bool waiting;
+    CoQueue waiters;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,21 +51,15 @@ static void coroutine_fn aio_task_co(void *opaque)
 
     g_free(task);
 
-    if (pool->waiting) {
-        pool->waiting = false;
-        aio_co_wake(pool->main_co);
-    }
+    qemu_co_queue_next(&pool->waiters);
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
     assert(pool->busy_tasks > 0);
-    assert(qemu_coroutine_self() == pool->main_co);
 
-    pool->waiting = true;
-    qemu_coroutine_yield();
+    qemu_co_queue_wait(&pool->waiters, NULL);
 
-    assert(!pool->waiting);
     assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
@@ -98,8 +91,8 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
 {
     AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-    pool->main_co = qemu_coroutine_self();
     pool->max_busy_tasks = max_busy_tasks;
+    qemu_co_queue_init(&pool->waiters);
 
     return pool;
 }
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] block/aio_task: allow start/wait task from any coroutine
  2020-06-11  7:36   ` [PATCH] block/aio_task: allow start/wait task from any coroutine Vladimir Sementsov-Ogievskiy
@ 2020-06-11 12:31     ` Denis V. Lunev
  2020-06-11 12:52       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-11 12:31 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel
  Cc: kwolf, dplotnikov, mreitz

On 6/11/20 10:36 AM, Vladimir Sementsov-Ogievskiy wrote:
> Currently, aio task pool assumes that there is a main coroutine, which
> creates tasks and wait for them. Let's remove the restriction by using
> CoQueue. Code becomes clearer, interface more obvious.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>
> Hi! Here is my counter-propasal for
> "[PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine"
> by Denis. I'm sure that if we are going to change something here, better
> is make the interface work from any coroutine without the restriction of
> only-on-waiter at the moment.
>
> (Note, that it is still not thread-safe)
>
>
>  block/aio_task.c | 15 ++++-----------
>  1 file changed, 4 insertions(+), 11 deletions(-)
>
> diff --git a/block/aio_task.c b/block/aio_task.c
> index 88989fa248..d48b29ff83 100644
> --- a/block/aio_task.c
> +++ b/block/aio_task.c
> @@ -27,11 +27,10 @@
>  #include "block/aio_task.h"
>  
>  struct AioTaskPool {
> -    Coroutine *main_co;
>      int status;
>      int max_busy_tasks;
>      int busy_tasks;
> -    bool waiting;
> +    CoQueue waiters;
>  };
>  
>  static void coroutine_fn aio_task_co(void *opaque)
> @@ -52,21 +51,15 @@ static void coroutine_fn aio_task_co(void *opaque)
>  
>      g_free(task);
>  
> -    if (pool->waiting) {
> -        pool->waiting = false;
> -        aio_co_wake(pool->main_co);
> -    }
> +    qemu_co_queue_next(&pool->waiters);
nope, this will wakeup only single waiter.
the code will deadlock If there are 2 waiters for the last
entry.

You need something like qemu_co_queue_restart_all() here
at least.

Den


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] block/aio_task: allow start/wait task from any coroutine
  2020-06-11 12:31     ` Denis V. Lunev
@ 2020-06-11 12:52       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-11 12:52 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel; +Cc: kwolf, dplotnikov, mreitz

11.06.2020 15:31, Denis V. Lunev wrote:
> On 6/11/20 10:36 AM, Vladimir Sementsov-Ogievskiy wrote:
>> Currently, aio task pool assumes that there is a main coroutine, which
>> creates tasks and wait for them. Let's remove the restriction by using
>> CoQueue. Code becomes clearer, interface more obvious.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> Hi! Here is my counter-propasal for
>> "[PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine"
>> by Denis. I'm sure that if we are going to change something here, better
>> is make the interface work from any coroutine without the restriction of
>> only-on-waiter at the moment.
>>
>> (Note, that it is still not thread-safe)
>>
>>
>>   block/aio_task.c | 15 ++++-----------
>>   1 file changed, 4 insertions(+), 11 deletions(-)
>>
>> diff --git a/block/aio_task.c b/block/aio_task.c
>> index 88989fa248..d48b29ff83 100644
>> --- a/block/aio_task.c
>> +++ b/block/aio_task.c
>> @@ -27,11 +27,10 @@
>>   #include "block/aio_task.h"
>>   
>>   struct AioTaskPool {
>> -    Coroutine *main_co;
>>       int status;
>>       int max_busy_tasks;
>>       int busy_tasks;
>> -    bool waiting;
>> +    CoQueue waiters;
>>   };
>>   
>>   static void coroutine_fn aio_task_co(void *opaque)
>> @@ -52,21 +51,15 @@ static void coroutine_fn aio_task_co(void *opaque)
>>   
>>       g_free(task);
>>   
>> -    if (pool->waiting) {
>> -        pool->waiting = false;
>> -        aio_co_wake(pool->main_co);
>> -    }
>> +    qemu_co_queue_next(&pool->waiters);
> nope, this will wakeup only single waiter.
> the code will deadlock If there are 2 waiters for the last
> entry.
> 

Hmm, right.
The problem is that original design combines into one thing two different:
  
   1. wait for all tasks to complete
   2. wait for a free slot, to start a new task

2. should work as is, with qemu_co_queue_next() and co-queue, and for 1. we
should have separate yield-point, to be waken up when busy_tasks becomes 0.

I'll resend.



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] qcow2: improve savevm performance
  2020-06-10 19:00 [PATCH v2 0/2] qcow2: seriously improve savevm performance Denis V. Lunev
  2020-06-10 19:00 ` [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine Denis V. Lunev
@ 2020-06-10 19:00 ` Denis V. Lunev
  2020-06-11  8:04   ` Vladimir Sementsov-Ogievskiy
  1 sibling, 1 reply; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 19:00 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V. Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
                original     fixed
cached:          1.79s       1.27s
non-cached:      3.29s       0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/qcow2.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h |   4 ++
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e6232f32e2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
     return ret;
 }
 
+
+typedef struct Qcow2VMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    int64_t offset;
+    void *buf;
+    size_t bytes;
+} Qcow2VMStateTask;
+
+typedef struct Qcow2SaveVMState {
+    AioTaskPool *pool;
+    Qcow2VMStateTask *t;
+} Qcow2SaveVMState;
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2SaveVMState *state = s->savevm_state;
     int ret;
 
+    if (state != NULL) {
+        aio_task_pool_start_task(state->pool, &state->t->task);
+
+        aio_task_pool_wait_all(state->pool);
+        ret = aio_task_pool_status(state->pool);
+
+        aio_task_pool_free(state->pool);
+        g_free(state);
+
+        s->savevm_state = NULL;
+
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_write_caches(bs);
     qemu_co_mutex_unlock(&s->lock);
@@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
     }
 }
 
+
+static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
+{
+    int err = 0;
+    Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
+
+    if (t->bytes != 0) {
+        QEMUIOVector local_qiov;
+        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);
+        err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
+                                               &local_qiov, 0, 0);
+    }
+
+    qemu_vfree(t->buf);
+    return err;
+}
+
+static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
+                                                    int64_t pos, size_t size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
+
+    *t = (Qcow2VMStateTask) {
+        .task.func = qcow2_co_vmstate_task_entry,
+        .buf = qemu_blockalign(bs, size),
+        .offset = qcow2_vm_state_offset(s) + pos,
+        .bs = bs,
+    };
+
+    return t;
+}
+
 static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
                               int64_t pos)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2SaveVMState *state = s->savevm_state;
+    Qcow2VMStateTask *t;
+    size_t buf_size = MAX(s->cluster_size, 1 * MiB);
+    size_t to_copy;
+    size_t off;
 
     BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
-    return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
-                                         qiov->size, qiov, 0, 0);
+
+    if (state == NULL) {
+        state = g_new(Qcow2SaveVMState, 1);
+        *state = (Qcow2SaveVMState) {
+            .pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
+            .t = qcow2_vmstate_task_create(bs, pos, buf_size),
+        };
+
+        s->savevm_state = state;
+    }
+
+    if (aio_task_pool_status(state->pool) != 0) {
+        return aio_task_pool_status(state->pool);
+    }
+
+    t = state->t;
+    if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_co_pwritev_part(bs,
+                qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
+    }
+
+    off = 0;
+    while (1) {
+        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+        t->bytes += to_copy;
+        if (t->bytes < buf_size) {
+            return 0;
+        }
+
+        aio_task_pool_start_task(state->pool, &t->task);
+
+        pos += to_copy;
+        off += to_copy;
+        state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
+    }
+
+    return 0;
 }
 
 static int qcow2_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
diff --git a/block/qcow2.h b/block/qcow2.h
index 7ce2c23bdb..146cfed739 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -291,6 +291,8 @@ typedef struct Qcow2BitmapHeaderExt {
 
 #define QCOW2_MAX_THREADS 4
 
+typedef struct Qcow2SaveVMState Qcow2SaveVMState;
+
 typedef struct BDRVQcow2State {
     int cluster_bits;
     int cluster_size;
@@ -384,6 +386,8 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    Qcow2SaveVMState *savevm_state;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] qcow2: improve savevm performance
  2020-06-10 19:00 ` [PATCH 2/2] qcow2: improve savevm performance Denis V. Lunev
@ 2020-06-11  8:04   ` Vladimir Sementsov-Ogievskiy
  2020-06-11  8:25     ` Denis V. Lunev
  0 siblings, 1 reply; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-11  8:04 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis Plotnikov, Max Reitz

10.06.2020 22:00, Denis V. Lunev wrote:
> This patch does 2 standard basic things:
> - it creates intermediate buffer for all writes from QEMU migration code
>    to QCOW2 image,
> - this buffer is sent to disk asynchronously, allowing several writes to
>    run in parallel.
> 
> In general, migration code is fantastically inefficent (by observation),
> buffers are not aligned and sent with arbitrary pieces, a lot of time
> less than 100 bytes at a chunk, which results in read-modify-write
> operations with non-cached operations. It should also be noted that all
> operations are performed into unallocated image blocks, which also suffer
> due to partial writes to such new clusters.
> 
> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
>                  original     fixed
> cached:          1.79s       1.27s
> non-cached:      3.29s       0.81s
> 
> The difference over HDD would be more significant :)
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>

If I follow correctly, you make qcow2_save_vmstate implicitly asynchronous:
it may return immediately after creating a task, and task is executing in
parallel.

I think, block-layer is unprepared for such behavior, it rely on the fact that
.bdrv_save_vmstate is synchronous.

For example, look at bdrv_co_rw_vmstate(). It calls drv->bdrv_save_vmstate
inside pair of bdrv_inc_in_flight()/bdrv_dec_in_flight(). It means that with
this patch, we may break drained section.

Next, it's a kind of cache for vmstate-write operation. It seems for me that
it's not directly related to qcow2. So, we can implement it in generic block
layer, where we can handle in_fligth requests. Can we keep .bdrv_save_vmstate
handlers of format drivers as is, keep them synchronous, but instead change
generic interface to be (optionally?) cached?

> ---
>   block/qcow2.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++-
>   block/qcow2.h |   4 ++
>   2 files changed, 113 insertions(+), 2 deletions(-)
> 
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 0cd2e6757e..e6232f32e2 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
>       return ret;
>   }
>   
> +
> +typedef struct Qcow2VMStateTask {
> +    AioTask task;
> +
> +    BlockDriverState *bs;
> +    int64_t offset;
> +    void *buf;
> +    size_t bytes;
> +} Qcow2VMStateTask;
> +
> +typedef struct Qcow2SaveVMState {
> +    AioTaskPool *pool;
> +    Qcow2VMStateTask *t;
> +} Qcow2SaveVMState;
> +
>   static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
>   {
>       BDRVQcow2State *s = bs->opaque;
> +    Qcow2SaveVMState *state = s->savevm_state;
>       int ret;
>   
> +    if (state != NULL) {
> +        aio_task_pool_start_task(state->pool, &state->t->task);
> +
> +        aio_task_pool_wait_all(state->pool);
> +        ret = aio_task_pool_status(state->pool);
> +
> +        aio_task_pool_free(state->pool);
> +        g_free(state);
> +
> +        s->savevm_state = NULL;
> +
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
>       qemu_co_mutex_lock(&s->lock);
>       ret = qcow2_write_caches(bs);
>       qemu_co_mutex_unlock(&s->lock);
> @@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
>       }
>   }
>   
> +
> +static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
> +{
> +    int err = 0;
> +    Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
> +
> +    if (t->bytes != 0) {
> +        QEMUIOVector local_qiov;
> +        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);
> +        err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
> +                                               &local_qiov, 0, 0);
> +    }
> +
> +    qemu_vfree(t->buf);
> +    return err;
> +}
> +
> +static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
> +                                                    int64_t pos, size_t size)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
> +
> +    *t = (Qcow2VMStateTask) {
> +        .task.func = qcow2_co_vmstate_task_entry,
> +        .buf = qemu_blockalign(bs, size),
> +        .offset = qcow2_vm_state_offset(s) + pos,
> +        .bs = bs,
> +    };
> +
> +    return t;
> +}
> +
>   static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
>                                 int64_t pos)
>   {
>       BDRVQcow2State *s = bs->opaque;
> +    Qcow2SaveVMState *state = s->savevm_state;
> +    Qcow2VMStateTask *t;
> +    size_t buf_size = MAX(s->cluster_size, 1 * MiB);
> +    size_t to_copy;
> +    size_t off;
>   
>       BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
> -    return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
> -                                         qiov->size, qiov, 0, 0);
> +
> +    if (state == NULL) {
> +        state = g_new(Qcow2SaveVMState, 1);
> +        *state = (Qcow2SaveVMState) {
> +            .pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
> +            .t = qcow2_vmstate_task_create(bs, pos, buf_size),
> +        };
> +
> +        s->savevm_state = state;
> +    }
> +
> +    if (aio_task_pool_status(state->pool) != 0) {
> +        return aio_task_pool_status(state->pool);
> +    }
> +
> +    t = state->t;
> +    if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
> +        /* Normally this branch is not reachable from migration */
> +        return bs->drv->bdrv_co_pwritev_part(bs,
> +                qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
> +    }
> +
> +    off = 0;
> +    while (1) {
> +        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
> +        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
> +        t->bytes += to_copy;
> +        if (t->bytes < buf_size) {
> +            return 0;
> +        }
> +
> +        aio_task_pool_start_task(state->pool, &t->task);
> +
> +        pos += to_copy;
> +        off += to_copy;
> +        state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
> +    }
> +
> +    return 0;
>   }
>   
>   static int qcow2_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
> diff --git a/block/qcow2.h b/block/qcow2.h
> index 7ce2c23bdb..146cfed739 100644
> --- a/block/qcow2.h
> +++ b/block/qcow2.h
> @@ -291,6 +291,8 @@ typedef struct Qcow2BitmapHeaderExt {
>   
>   #define QCOW2_MAX_THREADS 4
>   
> +typedef struct Qcow2SaveVMState Qcow2SaveVMState;
> +
>   typedef struct BDRVQcow2State {
>       int cluster_bits;
>       int cluster_size;
> @@ -384,6 +386,8 @@ typedef struct BDRVQcow2State {
>        * is to convert the image with the desired compression type set.
>        */
>       Qcow2CompressionType compression_type;
> +
> +    Qcow2SaveVMState *savevm_state;
>   } BDRVQcow2State;
>   
>   typedef struct Qcow2COWRegion {
> 


-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] qcow2: improve savevm performance
  2020-06-11  8:04   ` Vladimir Sementsov-Ogievskiy
@ 2020-06-11  8:25     ` Denis V. Lunev
  2020-06-11  8:44       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 1 reply; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-11  8:25 UTC (permalink / raw)
  To: Vladimir Sementsov-Ogievskiy, qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis Plotnikov, Max Reitz

On 6/11/20 11:04 AM, Vladimir Sementsov-Ogievskiy wrote:
> 10.06.2020 22:00, Denis V. Lunev wrote:
>> This patch does 2 standard basic things:
>> - it creates intermediate buffer for all writes from QEMU migration code
>>    to QCOW2 image,
>> - this buffer is sent to disk asynchronously, allowing several writes to
>>    run in parallel.
>>
>> In general, migration code is fantastically inefficent (by observation),
>> buffers are not aligned and sent with arbitrary pieces, a lot of time
>> less than 100 bytes at a chunk, which results in read-modify-write
>> operations with non-cached operations. It should also be noted that all
>> operations are performed into unallocated image blocks, which also
>> suffer
>> due to partial writes to such new clusters.
>>
>> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
>>                  original     fixed
>> cached:          1.79s       1.27s
>> non-cached:      3.29s       0.81s
>>
>> The difference over HDD would be more significant :)
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>
> If I follow correctly, you make qcow2_save_vmstate implicitly
> asynchronous:
> it may return immediately after creating a task, and task is executing in
> parallel.
>
> I think, block-layer is unprepared for such behavior, it rely on the
> fact that
> .bdrv_save_vmstate is synchronous.
>
> For example, look at bdrv_co_rw_vmstate(). It calls
> drv->bdrv_save_vmstate
> inside pair of bdrv_inc_in_flight()/bdrv_dec_in_flight(). It means
> that with
> this patch, we may break drained section.
>
Drained sections are not involved into the equation - the guest
is stopped at the moment.

If we are talking about in_flight count, it should not be a problem
even if the guest is running. We could just put inc/dec into
qcow2_co_vmstate_task_entry().

> Next, it's a kind of cache for vmstate-write operation. It seems for
> me that
> it's not directly related to qcow2. So, we can implement it in generic
> block
> layer, where we can handle in_fligth requests. Can we keep
> .bdrv_save_vmstate
> handlers of format drivers as is, keep them synchronous, but instead
> change
> generic interface to be (optionally?) cached?

This has been discussed already in the previous thread. .bdrv_save_vmstate
is implemented in QCOW2 and sheepdog only. Thus other formats are
mostly non-interested.

>
>> ---
>>   block/qcow2.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>   block/qcow2.h |   4 ++
>>   2 files changed, 113 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 0cd2e6757e..e6232f32e2 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState
>> *bs)
>>       return ret;
>>   }
>>   +
>> +typedef struct Qcow2VMStateTask {
>> +    AioTask task;
>> +
>> +    BlockDriverState *bs;
>> +    int64_t offset;
>> +    void *buf;
>> +    size_t bytes;
>> +} Qcow2VMStateTask;
>> +
>> +typedef struct Qcow2SaveVMState {
>> +    AioTaskPool *pool;
>> +    Qcow2VMStateTask *t;
>> +} Qcow2SaveVMState;
>> +
>>   static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
>>   {
>>       BDRVQcow2State *s = bs->opaque;
>> +    Qcow2SaveVMState *state = s->savevm_state;
>>       int ret;
>>   +    if (state != NULL) {
>> +        aio_task_pool_start_task(state->pool, &state->t->task);
>> +
>> +        aio_task_pool_wait_all(state->pool);
>> +        ret = aio_task_pool_status(state->pool);
>> +
>> +        aio_task_pool_free(state->pool);
>> +        g_free(state);
>> +
>> +        s->savevm_state = NULL;
>> +
>> +        if (ret < 0) {
>> +            return ret;
>> +        }
>> +    }
>> +
>>       qemu_co_mutex_lock(&s->lock);
>>       ret = qcow2_write_caches(bs);
>>       qemu_co_mutex_unlock(&s->lock);
>> @@ -5098,14 +5130,89 @@ static int
>> qcow2_has_zero_init(BlockDriverState *bs)
>>       }
>>   }
>>   +
>> +static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
>> +{
>> +    int err = 0;
>> +    Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
>> +
>> +    if (t->bytes != 0) {
>> +        QEMUIOVector local_qiov;
>> +        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);
>> +        err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset,
>> t->bytes,
>> +                                               &local_qiov, 0, 0);
>> +    }
>> +
>> +    qemu_vfree(t->buf);
>> +    return err;
>> +}
>> +
>> +static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState
>> *bs,
>> +                                                    int64_t pos,
>> size_t size)
>> +{
>> +    BDRVQcow2State *s = bs->opaque;
>> +    Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
>> +
>> +    *t = (Qcow2VMStateTask) {
>> +        .task.func = qcow2_co_vmstate_task_entry,
>> +        .buf = qemu_blockalign(bs, size),
>> +        .offset = qcow2_vm_state_offset(s) + pos,
>> +        .bs = bs,
>> +    };
>> +
>> +    return t;
>> +}
>> +
>>   static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector
>> *qiov,
>>                                 int64_t pos)
>>   {
>>       BDRVQcow2State *s = bs->opaque;
>> +    Qcow2SaveVMState *state = s->savevm_state;
>> +    Qcow2VMStateTask *t;
>> +    size_t buf_size = MAX(s->cluster_size, 1 * MiB);
>> +    size_t to_copy;
>> +    size_t off;
>>         BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
>> -    return bs->drv->bdrv_co_pwritev_part(bs,
>> qcow2_vm_state_offset(s) + pos,
>> -                                         qiov->size, qiov, 0, 0);
>> +
>> +    if (state == NULL) {
>> +        state = g_new(Qcow2SaveVMState, 1);
>> +        *state = (Qcow2SaveVMState) {
>> +            .pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
>> +            .t = qcow2_vmstate_task_create(bs, pos, buf_size),
>> +        };
>> +
>> +        s->savevm_state = state;
>> +    }
>> +
>> +    if (aio_task_pool_status(state->pool) != 0) {
>> +        return aio_task_pool_status(state->pool);
>> +    }
>> +
>> +    t = state->t;
>> +    if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
>> +        /* Normally this branch is not reachable from migration */
>> +        return bs->drv->bdrv_co_pwritev_part(bs,
>> +                qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0,
>> 0);
>> +    }
>> +
>> +    off = 0;
>> +    while (1) {
>> +        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
>> +        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
>> +        t->bytes += to_copy;
>> +        if (t->bytes < buf_size) {
>> +            return 0;
>> +        }
>> +
>> +        aio_task_pool_start_task(state->pool, &t->task);
>> +
>> +        pos += to_copy;
>> +        off += to_copy;
>> +        state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
>> +    }
>> +
>> +    return 0;
>>   }
>>     static int qcow2_load_vmstate(BlockDriverState *bs, QEMUIOVector
>> *qiov,
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 7ce2c23bdb..146cfed739 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -291,6 +291,8 @@ typedef struct Qcow2BitmapHeaderExt {
>>     #define QCOW2_MAX_THREADS 4
>>   +typedef struct Qcow2SaveVMState Qcow2SaveVMState;
>> +
>>   typedef struct BDRVQcow2State {
>>       int cluster_bits;
>>       int cluster_size;
>> @@ -384,6 +386,8 @@ typedef struct BDRVQcow2State {
>>        * is to convert the image with the desired compression type set.
>>        */
>>       Qcow2CompressionType compression_type;
>> +
>> +    Qcow2SaveVMState *savevm_state;
>>   } BDRVQcow2State;
>>     typedef struct Qcow2COWRegion {
>>
>
>



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] qcow2: improve savevm performance
  2020-06-11  8:25     ` Denis V. Lunev
@ 2020-06-11  8:44       ` Vladimir Sementsov-Ogievskiy
  0 siblings, 0 replies; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2020-06-11  8:44 UTC (permalink / raw)
  To: Denis V. Lunev, qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis Plotnikov, Max Reitz

11.06.2020 11:25, Denis V. Lunev wrote:
> On 6/11/20 11:04 AM, Vladimir Sementsov-Ogievskiy wrote:
>> 10.06.2020 22:00, Denis V. Lunev wrote:
>>> This patch does 2 standard basic things:
>>> - it creates intermediate buffer for all writes from QEMU migration code
>>>     to QCOW2 image,
>>> - this buffer is sent to disk asynchronously, allowing several writes to
>>>     run in parallel.
>>>
>>> In general, migration code is fantastically inefficent (by observation),
>>> buffers are not aligned and sent with arbitrary pieces, a lot of time
>>> less than 100 bytes at a chunk, which results in read-modify-write
>>> operations with non-cached operations. It should also be noted that all
>>> operations are performed into unallocated image blocks, which also
>>> suffer
>>> due to partial writes to such new clusters.
>>>
>>> Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
>>>                   original     fixed
>>> cached:          1.79s       1.27s
>>> non-cached:      3.29s       0.81s
>>>
>>> The difference over HDD would be more significant :)
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>
>> If I follow correctly, you make qcow2_save_vmstate implicitly
>> asynchronous:
>> it may return immediately after creating a task, and task is executing in
>> parallel.
>>
>> I think, block-layer is unprepared for such behavior, it rely on the
>> fact that
>> .bdrv_save_vmstate is synchronous.
>>
>> For example, look at bdrv_co_rw_vmstate(). It calls
>> drv->bdrv_save_vmstate
>> inside pair of bdrv_inc_in_flight()/bdrv_dec_in_flight(). It means
>> that with
>> this patch, we may break drained section.
>>
> Drained sections are not involved into the equation - the guest
> is stopped at the moment.
> 
> If we are talking about in_flight count, it should not be a problem
> even if the guest is running. We could just put inc/dec into
> qcow2_co_vmstate_task_entry().

No, we at least should do "inc" in qcow2_save_vmstate, when we are inside
original generic inc/dec. Not a big deal still.

> 
>> Next, it's a kind of cache for vmstate-write operation. It seems for
>> me that
>> it's not directly related to qcow2. So, we can implement it in generic
>> block
>> layer, where we can handle in_fligth requests. Can we keep
>> .bdrv_save_vmstate
>> handlers of format drivers as is, keep them synchronous, but instead
>> change
>> generic interface to be (optionally?) cached?
> 
> This has been discussed already in the previous thread. .bdrv_save_vmstate
> is implemented in QCOW2 and sheepdog only. Thus other formats are
> mostly non-interested.

Hmm, yes, sorry, I forget. It seems that sheepdoc project is dead. Who knows,
is it correct?

OK, it's probably OK to make the interface optianally-async, if we add in-flight
request correctly but at least we should document it around .bdrv_save_vmstate
handler declaration.



-- 
Best regards,
Vladimir


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine
@ 2020-06-10 18:58 Denis V. Lunev
  2020-06-10 18:58 ` [PATCH 2/2] qcow2: improve savevm performance Denis V. Lunev
  0 siblings, 1 reply; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 18:58 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V. Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

The patch preserves the constraint that the only waiter is allowed.

The patch renames AioTaskPool->main_co to wake_co and removes
AioTaskPool->waiting flag. wake_co keeps coroutine, which is
waiting for wakeup on worker completion. Thus 'waiting' flag
in this semantics is equivalent to 'wake_co != NULL'.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/aio_task.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/block/aio_task.c b/block/aio_task.c
index 88989fa248..5183b0729d 100644
--- a/block/aio_task.c
+++ b/block/aio_task.c
@@ -27,11 +27,10 @@
 #include "block/aio_task.h"
 
 struct AioTaskPool {
-    Coroutine *main_co;
+    Coroutine *wake_co;
     int status;
     int max_busy_tasks;
     int busy_tasks;
-    bool waiting;
 };
 
 static void coroutine_fn aio_task_co(void *opaque)
@@ -52,21 +51,21 @@ static void coroutine_fn aio_task_co(void *opaque)
 
     g_free(task);
 
-    if (pool->waiting) {
-        pool->waiting = false;
-        aio_co_wake(pool->main_co);
+    if (pool->wake_co != NULL) {
+        aio_co_wake(pool->wake_co);
+        pool->wake_co = NULL;
     }
 }
 
 void coroutine_fn aio_task_pool_wait_one(AioTaskPool *pool)
 {
     assert(pool->busy_tasks > 0);
-    assert(qemu_coroutine_self() == pool->main_co);
+    assert(pool->wake_co == NULL);
 
-    pool->waiting = true;
+    pool->wake_co = qemu_coroutine_self();
     qemu_coroutine_yield();
 
-    assert(!pool->waiting);
+    assert(pool->wake_co == NULL);
     assert(pool->busy_tasks < pool->max_busy_tasks);
 }
 
@@ -98,7 +97,7 @@ AioTaskPool *coroutine_fn aio_task_pool_new(int max_busy_tasks)
 {
     AioTaskPool *pool = g_new0(AioTaskPool, 1);
 
-    pool->main_co = qemu_coroutine_self();
+    pool->wake_co = NULL;
     pool->max_busy_tasks = max_busy_tasks;
 
     return pool;
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] qcow2: improve savevm performance
  2020-06-10 18:58 [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine Denis V. Lunev
@ 2020-06-10 18:58 ` Denis V. Lunev
  0 siblings, 0 replies; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 18:58 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V. Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
                original     fixed
cached:          1.79s       1.27s
non-cached:      3.29s       0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/qcow2.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h |   4 ++
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e6232f32e2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
     return ret;
 }
 
+
+typedef struct Qcow2VMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    int64_t offset;
+    void *buf;
+    size_t bytes;
+} Qcow2VMStateTask;
+
+typedef struct Qcow2SaveVMState {
+    AioTaskPool *pool;
+    Qcow2VMStateTask *t;
+} Qcow2SaveVMState;
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2SaveVMState *state = s->savevm_state;
     int ret;
 
+    if (state != NULL) {
+        aio_task_pool_start_task(state->pool, &state->t->task);
+
+        aio_task_pool_wait_all(state->pool);
+        ret = aio_task_pool_status(state->pool);
+
+        aio_task_pool_free(state->pool);
+        g_free(state);
+
+        s->savevm_state = NULL;
+
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_write_caches(bs);
     qemu_co_mutex_unlock(&s->lock);
@@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
     }
 }
 
+
+static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
+{
+    int err = 0;
+    Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
+
+    if (t->bytes != 0) {
+        QEMUIOVector local_qiov;
+        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);
+        err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
+                                               &local_qiov, 0, 0);
+    }
+
+    qemu_vfree(t->buf);
+    return err;
+}
+
+static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
+                                                    int64_t pos, size_t size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
+
+    *t = (Qcow2VMStateTask) {
+        .task.func = qcow2_co_vmstate_task_entry,
+        .buf = qemu_blockalign(bs, size),
+        .offset = qcow2_vm_state_offset(s) + pos,
+        .bs = bs,
+    };
+
+    return t;
+}
+
 static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
                               int64_t pos)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2SaveVMState *state = s->savevm_state;
+    Qcow2VMStateTask *t;
+    size_t buf_size = MAX(s->cluster_size, 1 * MiB);
+    size_t to_copy;
+    size_t off;
 
     BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
-    return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
-                                         qiov->size, qiov, 0, 0);
+
+    if (state == NULL) {
+        state = g_new(Qcow2SaveVMState, 1);
+        *state = (Qcow2SaveVMState) {
+            .pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
+            .t = qcow2_vmstate_task_create(bs, pos, buf_size),
+        };
+
+        s->savevm_state = state;
+    }
+
+    if (aio_task_pool_status(state->pool) != 0) {
+        return aio_task_pool_status(state->pool);
+    }
+
+    t = state->t;
+    if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_co_pwritev_part(bs,
+                qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
+    }
+
+    off = 0;
+    while (1) {
+        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+        t->bytes += to_copy;
+        if (t->bytes < buf_size) {
+            return 0;
+        }
+
+        aio_task_pool_start_task(state->pool, &t->task);
+
+        pos += to_copy;
+        off += to_copy;
+        state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
+    }
+
+    return 0;
 }
 
 static int qcow2_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
diff --git a/block/qcow2.h b/block/qcow2.h
index 7ce2c23bdb..146cfed739 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -291,6 +291,8 @@ typedef struct Qcow2BitmapHeaderExt {
 
 #define QCOW2_MAX_THREADS 4
 
+typedef struct Qcow2SaveVMState Qcow2SaveVMState;
+
 typedef struct BDRVQcow2State {
     int cluster_bits;
     int cluster_size;
@@ -384,6 +386,8 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    Qcow2SaveVMState *savevm_state;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 0/2] qcow2: seriously improve savevm performance
@ 2020-06-10 14:41 Denis V. Lunev
  2020-06-10 14:41 ` [PATCH 2/2] qcow2: " Denis V. Lunev
  0 siblings, 1 reply; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 14:41 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V . Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

This series do standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

This patch series is an implementation of idea discussed in the RFC
posted by Denis
https://lists.gnu.org/archive/html/qemu-devel/2020-04/msg01925.html
Results with this series over NVME are better than original code
                original     rfc    this
cached:          1.79s      2.38s   1.27s
non-cached:      3.29s      1.31s   0.81s

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] qcow2: improve savevm performance
  2020-06-10 14:41 [PATCH 0/2] qcow2: seriously " Denis V. Lunev
@ 2020-06-10 14:41 ` Denis V. Lunev
  0 siblings, 0 replies; 11+ messages in thread
From: Denis V. Lunev @ 2020-06-10 14:41 UTC (permalink / raw)
  To: qemu-block, qemu-devel
  Cc: Kevin Wolf, Denis V. Lunev, Vladimir Sementsov-Ogievskiy,
	Denis Plotnikov, Max Reitz

This patch does 2 standard basic things:
- it creates intermediate buffer for all writes from QEMU migration code
  to QCOW2 image,
- this buffer is sent to disk asynchronously, allowing several writes to
  run in parallel.

In general, migration code is fantastically inefficent (by observation),
buffers are not aligned and sent with arbitrary pieces, a lot of time
less than 100 bytes at a chunk, which results in read-modify-write
operations with non-cached operations. It should also be noted that all
operations are performed into unallocated image blocks, which also suffer
due to partial writes to such new clusters.

Snapshot creation time (2 GB Fedora-31 VM running over NVME storage):
                original     fixed
cached:          1.79s       1.27s
non-cached:      3.29s       0.81s

The difference over HDD would be more significant :)

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
CC: Denis Plotnikov <dplotnikov@virtuozzo.com>
---
 block/qcow2.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++-
 block/qcow2.h |   4 ++
 2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 0cd2e6757e..e2ae69422a 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -4797,11 +4797,43 @@ static int qcow2_make_empty(BlockDriverState *bs)
     return ret;
 }
 
+
+typedef struct Qcow2VMStateTask {
+    AioTask task;
+
+    BlockDriverState *bs;
+    int64_t offset;
+    void *buf;
+    size_t bytes;
+} Qcow2VMStateTask;
+
+typedef struct Qcow2SaveVMState {
+    AioTaskPool *pool;
+    Qcow2VMStateTask *t;
+} Qcow2SaveVMState;
+
 static coroutine_fn int qcow2_co_flush_to_os(BlockDriverState *bs)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2SaveVMState *state = s->savevm_state;
     int ret;
 
+    if (state != NULL) {
+        aio_task_pool_start_task(state->pool, &state->t->task);
+
+        aio_task_pool_wait_all(state->pool);
+        ret = aio_task_pool_status(state->pool);
+
+        aio_task_pool_free(state->pool);
+        g_free(state);
+
+        s->savevm_state = NULL;
+
+        if (ret < 0) {
+            return ret;
+        }
+    }
+
     qemu_co_mutex_lock(&s->lock);
     ret = qcow2_write_caches(bs);
     qemu_co_mutex_unlock(&s->lock);
@@ -5098,14 +5130,89 @@ static int qcow2_has_zero_init(BlockDriverState *bs)
     }
 }
 
+
+static coroutine_fn int qcow2_co_vmstate_task_entry(AioTask *task)
+{
+    int err;
+    Qcow2VMStateTask *t = container_of(task, Qcow2VMStateTask, task);
+
+    if (t->bytes != 0) {
+        QEMUIOVector local_qiov;
+        qemu_iovec_init_buf(&local_qiov, t->buf, t->bytes);
+        err = t->bs->drv->bdrv_co_pwritev_part(t->bs, t->offset, t->bytes,
+                                               &local_qiov, 0, 0);
+    }
+
+    qemu_vfree(t->buf);
+    return err;
+}
+
+static Qcow2VMStateTask *qcow2_vmstate_task_create(BlockDriverState *bs,
+                                                    int64_t pos, size_t size)
+{
+    BDRVQcow2State *s = bs->opaque;
+    Qcow2VMStateTask *t = g_new(Qcow2VMStateTask, 1);
+
+    *t = (Qcow2VMStateTask) {
+        .task.func = qcow2_co_vmstate_task_entry,
+        .buf = qemu_blockalign(bs, size),
+        .offset = qcow2_vm_state_offset(s) + pos,
+        .bs = bs,
+    };
+
+    return t;
+}
+
 static int qcow2_save_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
                               int64_t pos)
 {
     BDRVQcow2State *s = bs->opaque;
+    Qcow2SaveVMState *state = s->savevm_state;
+    Qcow2VMStateTask *t;
+    size_t buf_size = MAX(s->cluster_size, 1 * MiB);
+    size_t to_copy;
+    size_t off;
 
     BLKDBG_EVENT(bs->file, BLKDBG_VMSTATE_SAVE);
-    return bs->drv->bdrv_co_pwritev_part(bs, qcow2_vm_state_offset(s) + pos,
-                                         qiov->size, qiov, 0, 0);
+
+    if (state == NULL) {
+        state = g_new(Qcow2SaveVMState, 1);
+        *state = (Qcow2SaveVMState) {
+            .pool = aio_task_pool_new(QCOW2_MAX_WORKERS),
+            .t = qcow2_vmstate_task_create(bs, pos, buf_size),
+        };
+
+        s->savevm_state = state;
+    }
+
+    if (aio_task_pool_status(state->pool) != 0) {
+        return aio_task_pool_status(state->pool);
+    }
+
+    t = state->t;
+    if (t->offset + t->bytes != qcow2_vm_state_offset(s) + pos) {
+        /* Normally this branch is not reachable from migration */
+        return bs->drv->bdrv_co_pwritev_part(bs,
+                qcow2_vm_state_offset(s) + pos, qiov->size, qiov, 0, 0);
+    }
+
+    off = 0;
+    while (1) {
+        to_copy = MIN(qiov->size - off, buf_size - t->bytes);
+        qemu_iovec_to_buf(qiov, off, t->buf + t->bytes, to_copy);
+        t->bytes += to_copy;
+        if (t->bytes < buf_size) {
+            return 0;
+        }
+
+        aio_task_pool_start_task(state->pool, &t->task);
+
+        pos += to_copy;
+        off += to_copy;
+        state->t = t = qcow2_vmstate_task_create(bs, pos, buf_size);
+    }
+
+    return 0;
 }
 
 static int qcow2_load_vmstate(BlockDriverState *bs, QEMUIOVector *qiov,
diff --git a/block/qcow2.h b/block/qcow2.h
index 7ce2c23bdb..146cfed739 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -291,6 +291,8 @@ typedef struct Qcow2BitmapHeaderExt {
 
 #define QCOW2_MAX_THREADS 4
 
+typedef struct Qcow2SaveVMState Qcow2SaveVMState;
+
 typedef struct BDRVQcow2State {
     int cluster_bits;
     int cluster_size;
@@ -384,6 +386,8 @@ typedef struct BDRVQcow2State {
      * is to convert the image with the desired compression type set.
      */
     Qcow2CompressionType compression_type;
+
+    Qcow2SaveVMState *savevm_state;
 } BDRVQcow2State;
 
 typedef struct Qcow2COWRegion {
-- 
2.17.1



^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-11 13:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-06-10 19:00 [PATCH v2 0/2] qcow2: seriously improve savevm performance Denis V. Lunev
2020-06-10 19:00 ` [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine Denis V. Lunev
2020-06-11  7:36   ` [PATCH] block/aio_task: allow start/wait task from any coroutine Vladimir Sementsov-Ogievskiy
2020-06-11 12:31     ` Denis V. Lunev
2020-06-11 12:52       ` Vladimir Sementsov-Ogievskiy
2020-06-10 19:00 ` [PATCH 2/2] qcow2: improve savevm performance Denis V. Lunev
2020-06-11  8:04   ` Vladimir Sementsov-Ogievskiy
2020-06-11  8:25     ` Denis V. Lunev
2020-06-11  8:44       ` Vladimir Sementsov-Ogievskiy
  -- strict thread matches above, loose matches on Subject: below --
2020-06-10 18:58 [PATCH 1/2] aio: allow to wait for coroutine pool from different coroutine Denis V. Lunev
2020-06-10 18:58 ` [PATCH 2/2] qcow2: improve savevm performance Denis V. Lunev
2020-06-10 14:41 [PATCH 0/2] qcow2: seriously " Denis V. Lunev
2020-06-10 14:41 ` [PATCH 2/2] qcow2: " Denis V. Lunev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).