[Qemu-devel] [PATCH v2 0/2] block: Handle multiple write requests at once

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v2 0/2] block: Handle multiple write requests at once
@ 2009-09-08 12:49 Kevin Wolf
  2009-09-08 12:49 ` [Qemu-devel] [PATCH v2 1/2] Add bdrv_aio_multiwrite Kevin Wolf
  2009-09-08 12:49 ` [Qemu-devel] [PATCH v2 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf
  0 siblings, 2 replies; 4+ messages in thread
From: Kevin Wolf @ 2009-09-08 12:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf

virtio often issues multiple requests in a row, but each one independently. If
the block drivers knew all of the requests, they could optimize the way they
handle the requests. See the description of patch 1 for how qcow2 can use this
to avoid unnecessary writes to the disk.

v2:
The simple request merging code isn't qcow2 specific at all. Enable it for all
formats. A cleverer merging policy can later still be implemented in a driver
specific way.

Kevin Wolf (2):
  Add bdrv_aio_multiwrite
  virtio-blk: Use bdrv_aio_multiwrite

 block.c         |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h         |   15 ++++++
 block_int.h     |    3 +
 cutils.c        |   17 ++++++
 hw/virtio-blk.c |   50 ++++++++++++++++---
 qemu-common.h   |    1 +
 6 files changed, 225 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Qemu-devel] [PATCH v2 1/2] Add bdrv_aio_multiwrite
  2009-09-08 12:49 [Qemu-devel] [PATCH v2 0/2] block: Handle multiple write requests at once Kevin Wolf
@ 2009-09-08 12:49 ` Kevin Wolf
       [not found]   ` <m3ws49shch.fsf@neno.mitica>
  2009-09-08 12:49 ` [Qemu-devel] [PATCH v2 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf
  1 sibling, 1 reply; 4+ messages in thread
From: Kevin Wolf @ 2009-09-08 12:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf

One performance problem of qcow2 during the initial image growth are
sequential writes that are not cluster aligned. In this case, when a first
requests requires to allocate a new cluster but writes only to the first
couple of sectors in that cluster, the rest of the cluster is zeroed - just
to be overwritten by the following second request that fills up the cluster.

Let's try to merge sequential write requests to the same cluster, so we can
avoid to write the zero padding to the disk in the first place.

As a nice side effect, also other formats take advantage of dealing with less
and larger requests.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c       |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h       |   15 ++++++
 block_int.h   |    3 +
 cutils.c      |   17 +++++++
 qemu-common.h |    1 +
 5 files changed, 183 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 033957d..d9cc257 100644
--- a/block.c
+++ b/block.c
@@ -1354,6 +1354,153 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
     return ret;
 }
 
+
+typedef struct MultiwriteCB {
+    int error;
+    int num_requests;
+    int num_callbacks;
+    struct {
+        BlockDriverCompletionFunc *cb;
+        void *opaque;
+        QEMUIOVector *free_qiov;
+    } callbacks[];
+} MultiwriteCB;
+
+static void multiwrite_user_cb(MultiwriteCB *mcb)
+{
+    int i;
+
+    for (i = 0; i < mcb->num_callbacks; i++) {
+        mcb->callbacks[i].cb(mcb->callbacks[i].opaque, mcb->error);
+        qemu_free(mcb->callbacks[i].free_qiov);
+    }
+}
+
+static void multiwrite_cb(void *opaque, int ret)
+{
+    MultiwriteCB *mcb = opaque;
+
+    if (ret < 0) {
+        mcb->error = ret;
+        multiwrite_user_cb(mcb);
+    }
+
+    mcb->num_requests--;
+    if (mcb->num_requests == 0) {
+        if (mcb->error == 0) {
+            multiwrite_user_cb(mcb);
+        }
+        qemu_free(mcb);
+    }
+}
+
+static int multiwrite_req_compare(const void *a, const void *b)
+{
+    return (((BlockRequest*) a)->sector - ((BlockRequest*) b)->sector);
+}
+
+static void multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
+    int num_reqs, MultiwriteCB *mcb)
+{
+    int i, outidx;
+
+    // Sort requests by start sector
+    qsort(reqs, num_reqs, sizeof(*reqs), &multiwrite_req_compare);
+
+    // Check if adjacent requests touch the same clusters. If so, combine them,
+    // filling up gaps with zero sectors.
+    outidx = 0;
+    for (i = 1; i < num_reqs; i++) {
+        int merge = 0;
+
+        // This is only handling exactly sequential writes - all block
+        // drivers can benefit from merging in this case. For other
+        // patterns we need to ask the driver.
+        if (reqs[i].sector == reqs[outidx].sector + reqs[outidx].nb_sectors) {
+            merge = 1;
+        }
+
+#if 0
+        if (!merge && drv->bdrv_check_merge) {
+            merge = drv->bdrv_check_merge(bs, &reqs[outidx], &reqs[i]);
+        }
+#endif
+
+        if (merge) {
+            reqs[outidx].nb_sectors += reqs[i].nb_sectors;
+            reqs[outidx].qiov =
+                qemu_iovec_concat(reqs[outidx].qiov, reqs[i].qiov);
+            mcb->callbacks[i].free_qiov = reqs[outidx].qiov;
+        } else {
+            outidx++;
+            reqs[outidx].sector     = reqs[i].sector;
+            reqs[outidx].nb_sectors = reqs[i].nb_sectors;
+            reqs[outidx].qiov       = reqs[i].qiov;
+        }
+    }
+}
+
+/*
+ * Submit multiple AIO write requests at once.
+ *
+ * On success, the function returns 0 and all requests in the reqs array have
+ * been submitted. In error case this function returns -1, and any of the
+ * requests may or may not be submitted yet. In particular, this means that the
+ * callback will be called for some of the requests, for others it won't. The
+ * caller must check the error field of the BlockRequest to wait for the right
+ * callbacks (if error != 0, no callback will be called).
+ *
+ * The implementation may modify the contents of the reqs array, e.g. to merge
+ * requests. However, the fields opaque and error are left unmodified as they
+ * are used to signal failure for a single request to the caller.
+ */
+int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs)
+{
+    BlockDriverAIOCB *acb;
+    MultiwriteCB *mcb;
+    int i;
+
+    // Create MultiwriteCB structure
+    mcb = qemu_mallocz(sizeof(*mcb) + num_reqs * sizeof(*mcb->callbacks));
+    mcb->num_requests = 0;
+    mcb->num_callbacks = num_reqs;
+
+    for (i = 0; i < num_reqs; i++) {
+        mcb->callbacks[i].cb = reqs[i].cb;
+        mcb->callbacks[i].opaque = reqs[i].opaque;
+    }
+
+    // Check for mergable requests
+    multiwrite_merge(bs, reqs, num_reqs, mcb);
+
+    // Run the aio requests
+    for (i = 0; i < num_reqs; i++) {
+        acb = bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov,
+            reqs[i].nb_sectors, multiwrite_cb, mcb);
+
+        if (acb == NULL) {
+            // We can only fail the whole thing if no request has been
+            // submitted yet. Otherwise we'll wait for the submitted AIOs to
+            // complete and report the error in the callback.
+            if (mcb->num_requests == 0) {
+                reqs[i].error = EIO;
+                goto fail;
+            } else {
+                mcb->error = EIO;
+                break;
+            }
+        } else {
+            mcb->num_requests++;
+        }
+    }
+
+    return 0;
+
+fail:
+    free(mcb);
+    return -1;
+}
+
 void bdrv_aio_cancel(BlockDriverAIOCB *acb)
 {
     acb->pool->cancel(acb);
diff --git a/block.h b/block.h
index 28bf357..ea69052 100644
--- a/block.h
+++ b/block.h
@@ -87,6 +87,21 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                   BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
 
+typedef struct BlockRequest {
+    /* Fields to be filled by multiwrite caller */
+    int64_t sector;
+    int nb_sectors;
+    QEMUIOVector *qiov;
+    BlockDriverCompletionFunc *cb;
+    void *opaque;
+
+    /* Filled by multiwrite implementation */
+    int error;
+} BlockRequest;
+
+int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs,
+    int num_reqs);
+
 /* sg packet commands */
 int bdrv_ioctl(BlockDriverState *bs, unsigned long int req, void *buf);
 BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
diff --git a/block_int.h b/block_int.h
index 0902fd4..027b1d8 100644
--- a/block_int.h
+++ b/block_int.h
@@ -70,6 +70,9 @@ struct BlockDriver {
         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque);
 
+    int (*bdrv_aio_multiwrite)(BlockDriverState *bs, BlockRequest *reqs,
+        int num_reqs);
+
     const char *protocol_name;
     int (*bdrv_truncate)(BlockDriverState *bs, int64_t offset);
     int64_t (*bdrv_getlength)(BlockDriverState *bs);
diff --git a/cutils.c b/cutils.c
index bd9a019..e12013a 100644
--- a/cutils.c
+++ b/cutils.c
@@ -151,6 +151,23 @@ void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len)
     ++qiov->niov;
 }
 
+QEMUIOVector *qemu_iovec_concat(QEMUIOVector *a, QEMUIOVector *b)
+{
+    int i;
+    QEMUIOVector *qiov = qemu_malloc(sizeof(*qiov));
+
+    qemu_iovec_init(qiov, a->niov + b->niov);
+
+    for (i = 0; i < a->niov; i++) {
+        qemu_iovec_add(qiov, a->iov[i].iov_base, a->iov[i].iov_len);
+    }
+    for (i = 0; i < b->niov; i++) {
+        qemu_iovec_add(qiov, b->iov[i].iov_base, b->iov[i].iov_len);
+    }
+
+    return qiov;
+}
+
 void qemu_iovec_destroy(QEMUIOVector *qiov)
 {
     assert(qiov->nalloc != -1);
diff --git a/qemu-common.h b/qemu-common.h
index 74ac88f..2871820 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -223,6 +223,7 @@ typedef struct QEMUIOVector {
 void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
 void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov);
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
+QEMUIOVector *qemu_iovec_concat(QEMUIOVector *a, QEMUIOVector *b);
 void qemu_iovec_destroy(QEMUIOVector *qiov);
 void qemu_iovec_reset(QEMUIOVector *qiov);
 void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf);
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Qemu-devel] [PATCH v2 2/2] virtio-blk: Use bdrv_aio_multiwrite
  2009-09-08 12:49 [Qemu-devel] [PATCH v2 0/2] block: Handle multiple write requests at once Kevin Wolf
  2009-09-08 12:49 ` [Qemu-devel] [PATCH v2 1/2] Add bdrv_aio_multiwrite Kevin Wolf
@ 2009-09-08 12:49 ` Kevin Wolf
  1 sibling, 0 replies; 4+ messages in thread
From: Kevin Wolf @ 2009-09-08 12:49 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf

It is quite common for virtio-blk to submit more than one write request in a
row to the qemu block layer. Use bdrv_aio_multiwrite to allow block drivers to
optimize its handling of the requests.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/virtio-blk.c |   50 ++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index c160246..5c88c12 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -252,15 +252,40 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
 }
 #endif /* __linux__ */
 
-static void virtio_blk_handle_write(VirtIOBlockReq *req)
+static void do_multiwrite(BlockDriverState *bs, BlockRequest *blkreq,
+    int num_writes)
 {
-    BlockDriverAIOCB *acb;
+    int i, ret;
+    ret = bdrv_aio_multiwrite(bs, blkreq, num_writes);
+
+    if (ret != 0) {
+        for (i = 0; i < num_writes; i++) {
+            if (blkreq[i].error) {
+                virtio_blk_req_complete(blkreq[i].opaque, VIRTIO_BLK_S_IOERR);
+            }
+        }
+    }
+}
 
-    acb = bdrv_aio_writev(req->dev->bs, req->out->sector, &req->qiov,
-                          req->qiov.size / 512, virtio_blk_rw_complete, req);
-    if (!acb) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
+static void virtio_blk_handle_write(BlockRequest *blkreq, int *num_writes,
+    VirtIOBlockReq *req, BlockDriverState **old_bs)
+{
+    if (req->dev->bs != *old_bs || *num_writes == 32) {
+        if (*old_bs != NULL) {
+            do_multiwrite(*old_bs, blkreq, *num_writes);
+        }
+        *num_writes = 0;
+        *old_bs = req->dev->bs;
     }
+
+    blkreq[*num_writes].sector = req->out->sector;
+    blkreq[*num_writes].nb_sectors = req->qiov.size / 512;
+    blkreq[*num_writes].qiov = &req->qiov;
+    blkreq[*num_writes].cb = virtio_blk_rw_complete;
+    blkreq[*num_writes].opaque = req;
+    blkreq[*num_writes].error = 0;
+
+    (*num_writes)++;
 }
 
 static void virtio_blk_handle_read(VirtIOBlockReq *req)
@@ -278,6 +303,9 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBlock *s = to_virtio_blk(vdev);
     VirtIOBlockReq *req;
+    BlockRequest blkreq[32];
+    int num_writes = 0;
+    BlockDriverState *old_bs = NULL;
 
     while ((req = virtio_blk_get_request(s))) {
         if (req->elem.out_num < 1 || req->elem.in_num < 1) {
@@ -299,13 +327,18 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         } else if (req->out->type & VIRTIO_BLK_T_OUT) {
             qemu_iovec_init_external(&req->qiov, &req->elem.out_sg[1],
                                      req->elem.out_num - 1);
-            virtio_blk_handle_write(req);
+            virtio_blk_handle_write(blkreq, &num_writes, req, &old_bs);
         } else {
             qemu_iovec_init_external(&req->qiov, &req->elem.in_sg[0],
                                      req->elem.in_num - 1);
             virtio_blk_handle_read(req);
         }
     }
+
+    if (num_writes > 0) {
+        do_multiwrite(old_bs, blkreq, num_writes);
+    }
+
     /*
      * FIXME: Want to check for completions before returning to guest mode,
      * so cached reads and writes are reported as quickly as possible. But
@@ -324,7 +357,8 @@ static void virtio_blk_dma_restart_bh(void *opaque)
     s->rq = NULL;
 
     while (req) {
-        virtio_blk_handle_write(req);
+        bdrv_aio_writev(req->dev->bs, req->out->sector, &req->qiov,
+            req->qiov.size / 512, virtio_blk_rw_complete, req);
         req = req->next;
     }
 }
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [Qemu-devel] Re: [PATCH v2 1/2] Add bdrv_aio_multiwrite
       [not found]   ` <m3ws49shch.fsf@neno.mitica>
@ 2009-09-09 10:44     ` Kevin Wolf
  0 siblings, 0 replies; 4+ messages in thread
From: Kevin Wolf @ 2009-09-09 10:44 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

Am 08.09.2009 15:20, schrieb Juan Quintela:
> Kevin Wolf <kwolf@redhat.com> wrote:
>> One performance problem of qcow2 during the initial image growth are
>> sequential writes that are not cluster aligned. In this case, when a first
>> requests requires to allocate a new cluster but writes only to the first
>> couple of sectors in that cluster, the rest of the cluster is zeroed - just
>> to be overwritten by the following second request that fills up the cluster.
>>
>> Let's try to merge sequential write requests to the same cluster, so we can
>> avoid to write the zero padding to the disk in the first place.
>>
>> As a nice side effect, also other formats take advantage of dealing with less
>> and larger requests.
>>
>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>> ---
>>  block.c       |  147 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  block.h       |   15 ++++++
>>  block_int.h   |    3 +
>>  cutils.c      |   17 +++++++
>>  qemu-common.h |    1 +
>>  5 files changed, 183 insertions(+), 0 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 033957d..d9cc257 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1354,6 +1354,153 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>>      return ret;
>>  }
>>  
>> +
>> +typedef struct MultiwriteCB {
>> +    int error;
>> +    int num_requests;
>> +    int num_callbacks;
>> +    struct {
>> +        BlockDriverCompletionFunc *cb;
>> +        void *opaque;
>> +        QEMUIOVector *free_qiov;
>> +    } callbacks[];
>> +} MultiwriteCB;
>> +
>> +static void multiwrite_user_cb(MultiwriteCB *mcb)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < mcb->num_callbacks; i++) {
>> +        mcb->callbacks[i].cb(mcb->callbacks[i].opaque, mcb->error);
>> +        qemu_free(mcb->callbacks[i].free_qiov);
>> +    }
>> +}
>> +
>> +static void multiwrite_cb(void *opaque, int ret)
>> +{
>> +    MultiwriteCB *mcb = opaque;
>> +
>> +    if (ret < 0) {
>> +        mcb->error = ret;
>> +        multiwrite_user_cb(mcb);
>> +    }
>> +
>> +    mcb->num_requests--;
>> +    if (mcb->num_requests == 0) {
>> +        if (mcb->error == 0) {
>> +            multiwrite_user_cb(mcb);
>> +        }
>> +        qemu_free(mcb);
>> +    }
>> +}
>> +
>> +static int multiwrite_req_compare(const void *a, const void *b)
>> +{
>> +    return (((BlockRequest*) a)->sector - ((BlockRequest*) b)->sector);
>> +}
>> +
>> +static void multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
>> +    int num_reqs, MultiwriteCB *mcb)
>> +{
>> +    int i, outidx;
>> +
>> +    // Sort requests by start sector
>> +    qsort(reqs, num_reqs, sizeof(*reqs), &multiwrite_req_compare);
>> +
>> +    // Check if adjacent requests touch the same clusters. If so, combine them,
>> +    // filling up gaps with zero sectors.
>> +    outidx = 0;
>> +    for (i = 1; i < num_reqs; i++) {
>> +        int merge = 0;
>> +
>> +        // This is only handling exactly sequential writes - all block
>> +        // drivers can benefit from merging in this case. For other
>> +        // patterns we need to ask the driver.
>> +        if (reqs[i].sector == reqs[outidx].sector + reqs[outidx].nb_sectors) {
>> +            merge = 1;
> 
> shouldn't we handle overlapped writes?  Didn't look too difficult once
> you have this infrastructure.

We could handle overlapping writes here in generic code, right. I'd
prefer to start with something simple and add such extensions later. I'm
not even sure if there is any point in optimizing overlapping write
requests - there is little reason for a guest to submit them in the
first place.

What happens more frequently is that two requests touch the same
cluster, but some sectors between the end of the first request and the
start of the second request are left untouched. As long as the cluster
is unallocated, we can optimize this case in a format specific way by
inserting zero padding...

>> +        }
>> +
>> +#if 0
>> +        if (!merge && drv->bdrv_check_merge) {
>> +            merge = drv->bdrv_check_merge(bs, &reqs[outidx], &reqs[i]);
>> +        }
>> +#endif
> 
> This can/should be removed :)

...and bringing this code to life. :-)

To express  in numbers what I said above, here some numbers of a RHEL
installation:

Possible merges: 45830
Merged: 18141
Gap between requests: 27689
Overlapping: 0

>> +
>> +        if (merge) {
>> +            reqs[outidx].nb_sectors += reqs[i].nb_sectors;
> 
> This don't handle overlapping writes.  Is there any test to be sure that
> we are not having overlapping writes?  Otherwise, adding support should
> be trivial.

Currently overlapping writes are never merged, so let's leave this for a
patch on top that actually implements this functionality.

Kevin

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-09-09 10:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-08 12:49 [Qemu-devel] [PATCH v2 0/2] block: Handle multiple write requests at once Kevin Wolf
2009-09-08 12:49 ` [Qemu-devel] [PATCH v2 1/2] Add bdrv_aio_multiwrite Kevin Wolf
     [not found]   ` <m3ws49shch.fsf@neno.mitica>
2009-09-09 10:44     ` [Qemu-devel] " Kevin Wolf
2009-09-08 12:49 ` [Qemu-devel] [PATCH v2 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).