[Qemu-devel] [PATCH v3 0/2] block: Handle multiple write requests at once

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/2] block: Handle multiple write requests at once
@ 2009-09-09 15:53 Kevin Wolf
  2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite Kevin Wolf
  2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf
  0 siblings, 2 replies; 10+ messages in thread
From: Kevin Wolf @ 2009-09-09 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf

virtio often issues multiple requests in a row, but each one independently. If
the block drivers knew all of the requests, they could optimize the way they
handle the requests. See the description of patch 1 for how qcow2 can use this
to avoid unnecessary writes to the disk.

v2:
The simple request merging code isn't qcow2 specific at all. Enable it for all
formats. A cleverer merging policy can later still be implemented in a driver
specific way.

v3:
Handle overlapping requests, add hook for block drivers to merge even more
requests.

Kevin Wolf (2):
  Add bdrv_aio_multiwrite
  virtio-blk: Use bdrv_aio_multiwrite

 block.c         |  183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h         |   15 +++++
 block_int.h     |    6 ++
 cutils.c        |   25 ++++++++
 hw/virtio-blk.c |   50 +++++++++++++---
 qemu-common.h   |    1 +
 6 files changed, 272 insertions(+), 8 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite
  2009-09-09 15:53 [Qemu-devel] [PATCH v3 0/2] block: Handle multiple write requests at once Kevin Wolf
@ 2009-09-09 15:53 ` Kevin Wolf
       [not found]   ` <m3ocpj67ip.fsf@neno.mitica>
  2009-09-10 22:44   ` [Qemu-devel] " Christoph Hellwig
  2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf
  1 sibling, 2 replies; 10+ messages in thread
From: Kevin Wolf @ 2009-09-09 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf

One performance problem of qcow2 during the initial image growth are
sequential writes that are not cluster aligned. In this case, when a first
requests requires to allocate a new cluster but writes only to the first
couple of sectors in that cluster, the rest of the cluster is zeroed - just
to be overwritten by the following second request that fills up the cluster.

Let's try to merge sequential write requests to the same cluster, so we can
avoid to write the zero padding to the disk in the first place.

As a nice side effect, also other formats take advantage of dealing with less
and larger requests.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block.c       |  183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 block.h       |   15 +++++
 block_int.h   |    6 ++
 cutils.c      |   25 ++++++++
 qemu-common.h |    1 +
 5 files changed, 230 insertions(+), 0 deletions(-)

diff --git a/block.c b/block.c
index 033957d..45ad6fb 100644
--- a/block.c
+++ b/block.c
@@ -1354,6 +1354,189 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
     return ret;
 }
 
+
+typedef struct MultiwriteCB {
+    int error;
+    int num_requests;
+    int num_callbacks;
+    struct {
+        BlockDriverCompletionFunc *cb;
+        void *opaque;
+        QEMUIOVector *free_qiov;
+        void *free_buf;
+    } callbacks[];
+} MultiwriteCB;
+
+static void multiwrite_user_cb(MultiwriteCB *mcb)
+{
+    int i;
+
+    for (i = 0; i < mcb->num_callbacks; i++) {
+        mcb->callbacks[i].cb(mcb->callbacks[i].opaque, mcb->error);
+        qemu_free(mcb->callbacks[i].free_qiov);
+        qemu_free(mcb->callbacks[i].free_buf);
+    }
+}
+
+static void multiwrite_cb(void *opaque, int ret)
+{
+    MultiwriteCB *mcb = opaque;
+
+    if (ret < 0) {
+        mcb->error = ret;
+        multiwrite_user_cb(mcb);
+    }
+
+    mcb->num_requests--;
+    if (mcb->num_requests == 0) {
+        if (mcb->error == 0) {
+            multiwrite_user_cb(mcb);
+        }
+        qemu_free(mcb);
+    }
+}
+
+static int multiwrite_req_compare(const void *a, const void *b)
+{
+    return (((BlockRequest*) a)->sector - ((BlockRequest*) b)->sector);
+}
+
+/*
+ * Takes a bunch of requests and tries to merge them. Returns the number of
+ * requests that remain after merging.
+ */
+static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
+    int num_reqs, MultiwriteCB *mcb)
+{
+    int i, outidx;
+
+    // Sort requests by start sector
+    qsort(reqs, num_reqs, sizeof(*reqs), &multiwrite_req_compare);
+
+    // Check if adjacent requests touch the same clusters. If so, combine them,
+    // filling up gaps with zero sectors.
+    outidx = 0;
+    for (i = 1; i < num_reqs; i++) {
+        int merge = 0;
+        int64_t oldreq_last = reqs[outidx].sector + reqs[outidx].nb_sectors;
+
+        // This handles the cases that are valid for all block drivers, namely
+        // exactly sequential writes and overlapping writes.
+        if (reqs[i].sector <= oldreq_last) {
+            merge = 1;
+        }
+
+        // The block driver may decide that it makes sense to combine requests
+        // even if there is a gap of some sectors between them. In this case,
+        // the gap is filled with zeros (therefore only applicable for yet
+        // unused space in format like qcow2).
+        if (!merge && bs->drv->bdrv_merge_requests) {
+            merge = bs->drv->bdrv_merge_requests(bs, &reqs[outidx], &reqs[i]);
+        }
+
+        if (merge) {
+            size_t size;
+            QEMUIOVector *qiov = qemu_mallocz(sizeof(*qiov));
+            qemu_iovec_init(qiov,
+                reqs[outidx].qiov->niov + reqs[i].qiov->niov + 1);
+
+            // Add the first request to the merged one. If the requests are
+            // overlapping, drop the last sectors of the first request.
+            size = (reqs[i].sector - reqs[outidx].sector) << 9;
+            qemu_iovec_concat(qiov, reqs[outidx].qiov, size);
+
+            // We might need to add some zeros between the two requests
+            if (reqs[i].sector > oldreq_last) {
+                size_t zero_bytes = (reqs[i].sector - oldreq_last) << 9;
+                uint8_t *buf = qemu_blockalign(bs, zero_bytes);
+                memset(buf, 0, zero_bytes);
+                qemu_iovec_add(qiov, buf, zero_bytes);
+                mcb->callbacks[i].free_buf = buf;
+            }
+
+            // Add the second request
+            qemu_iovec_concat(qiov, reqs[i].qiov, reqs[i].qiov->size);
+
+            reqs[outidx].nb_sectors += reqs[i].nb_sectors;
+            reqs[outidx].qiov = qiov;
+
+            mcb->callbacks[i].free_qiov = reqs[outidx].qiov;
+        } else {
+            outidx++;
+            reqs[outidx].sector     = reqs[i].sector;
+            reqs[outidx].nb_sectors = reqs[i].nb_sectors;
+            reqs[outidx].qiov       = reqs[i].qiov;
+        }
+    }
+
+    return outidx + 1;
+}
+
+/*
+ * Submit multiple AIO write requests at once.
+ *
+ * On success, the function returns 0 and all requests in the reqs array have
+ * been submitted. In error case this function returns -1, and any of the
+ * requests may or may not be submitted yet. In particular, this means that the
+ * callback will be called for some of the requests, for others it won't. The
+ * caller must check the error field of the BlockRequest to wait for the right
+ * callbacks (if error != 0, no callback will be called).
+ *
+ * The implementation may modify the contents of the reqs array, e.g. to merge
+ * requests. However, the fields opaque and error are left unmodified as they
+ * are used to signal failure for a single request to the caller.
+ */
+int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs, int num_reqs)
+{
+    BlockDriverAIOCB *acb;
+    MultiwriteCB *mcb;
+    int i;
+
+    if (num_reqs == 0) {
+        return 0;
+    }
+
+    // Create MultiwriteCB structure
+    mcb = qemu_mallocz(sizeof(*mcb) + num_reqs * sizeof(*mcb->callbacks));
+    mcb->num_requests = 0;
+    mcb->num_callbacks = num_reqs;
+
+    for (i = 0; i < num_reqs; i++) {
+        mcb->callbacks[i].cb = reqs[i].cb;
+        mcb->callbacks[i].opaque = reqs[i].opaque;
+    }
+
+    // Check for mergable requests
+    num_reqs = multiwrite_merge(bs, reqs, num_reqs, mcb);
+
+    // Run the aio requests
+    for (i = 0; i < num_reqs; i++) {
+        acb = bdrv_aio_writev(bs, reqs[i].sector, reqs[i].qiov,
+            reqs[i].nb_sectors, multiwrite_cb, mcb);
+
+        if (acb == NULL) {
+            // We can only fail the whole thing if no request has been
+            // submitted yet. Otherwise we'll wait for the submitted AIOs to
+            // complete and report the error in the callback.
+            if (mcb->num_requests == 0) {
+                reqs[i].error = EIO;
+                goto fail;
+            } else {
+                mcb->error = EIO;
+                break;
+            }
+        } else {
+            mcb->num_requests++;
+        }
+    }
+
+    return 0;
+
+fail:
+    free(mcb);
+    return -1;
+}
+
 void bdrv_aio_cancel(BlockDriverAIOCB *acb)
 {
     acb->pool->cancel(acb);
diff --git a/block.h b/block.h
index 28bf357..ea69052 100644
--- a/block.h
+++ b/block.h
@@ -87,6 +87,21 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                   BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
 
+typedef struct BlockRequest {
+    /* Fields to be filled by multiwrite caller */
+    int64_t sector;
+    int nb_sectors;
+    QEMUIOVector *qiov;
+    BlockDriverCompletionFunc *cb;
+    void *opaque;
+
+    /* Filled by multiwrite implementation */
+    int error;
+} BlockRequest;
+
+int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs,
+    int num_reqs);
+
 /* sg packet commands */
 int bdrv_ioctl(BlockDriverState *bs, unsigned long int req, void *buf);
 BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
diff --git a/block_int.h b/block_int.h
index 0902fd4..3e4b4a3 100644
--- a/block_int.h
+++ b/block_int.h
@@ -70,6 +70,12 @@ struct BlockDriver {
         int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
         BlockDriverCompletionFunc *cb, void *opaque);
 
+    int (*bdrv_aio_multiwrite)(BlockDriverState *bs, BlockRequest *reqs,
+        int num_reqs);
+    int (*bdrv_merge_requests)(BlockDriverState *bs, BlockRequest* a,
+        BlockRequest *b);
+
+
     const char *protocol_name;
     int (*bdrv_truncate)(BlockDriverState *bs, int64_t offset);
     int64_t (*bdrv_getlength)(BlockDriverState *bs);
diff --git a/cutils.c b/cutils.c
index bd9a019..ffe5c71 100644
--- a/cutils.c
+++ b/cutils.c
@@ -151,6 +151,31 @@ void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len)
     ++qiov->niov;
 }
 
+/*
+ * Copies iovecs from src to the end dst until src is completely copied or the
+ * total size of the copied iovec reaches size. The size of the last copied
+ * iovec is changed in order to fit the specified total size if it isn't a
+ * perfect fit already.
+ */
+void qemu_iovec_concat(QEMUIOVector *dst, QEMUIOVector *src, size_t size)
+{
+    int i;
+    size_t done;
+
+    assert(dst->nalloc != -1);
+
+    done = 0;
+    for (i = 0; (i < src->niov) && (done != size); i++) {
+        if (done + src->iov[i].iov_len > size) {
+            qemu_iovec_add(dst, src->iov[i].iov_base, size - done);
+            break;
+        } else {
+            qemu_iovec_add(dst, src->iov[i].iov_base, src->iov[i].iov_len);
+        }
+        done += src->iov[i].iov_len;
+    }
+}
+
 void qemu_iovec_destroy(QEMUIOVector *qiov)
 {
     assert(qiov->nalloc != -1);
diff --git a/qemu-common.h b/qemu-common.h
index 74ac88f..f3cfb68 100644
--- a/qemu-common.h
+++ b/qemu-common.h
@@ -223,6 +223,7 @@ typedef struct QEMUIOVector {
 void qemu_iovec_init(QEMUIOVector *qiov, int alloc_hint);
 void qemu_iovec_init_external(QEMUIOVector *qiov, struct iovec *iov, int niov);
 void qemu_iovec_add(QEMUIOVector *qiov, void *base, size_t len);
+void qemu_iovec_concat(QEMUIOVector *dst, QEMUIOVector *src, size_t size);
 void qemu_iovec_destroy(QEMUIOVector *qiov);
 void qemu_iovec_reset(QEMUIOVector *qiov);
 void qemu_iovec_to_buffer(QEMUIOVector *qiov, void *buf);
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite
  2009-09-09 15:53 [Qemu-devel] [PATCH v3 0/2] block: Handle multiple write requests at once Kevin Wolf
  2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite Kevin Wolf
@ 2009-09-09 15:53 ` Kevin Wolf
  2009-09-10 22:41   ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Kevin Wolf @ 2009-09-09 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf

It is quite common for virtio-blk to submit more than one write request in a
row to the qemu block layer. Use bdrv_aio_multiwrite to allow block drivers to
optimize its handling of the requests.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 hw/virtio-blk.c |   50 ++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index c160246..5c88c12 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -252,15 +252,40 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
 }
 #endif /* __linux__ */
 
-static void virtio_blk_handle_write(VirtIOBlockReq *req)
+static void do_multiwrite(BlockDriverState *bs, BlockRequest *blkreq,
+    int num_writes)
 {
-    BlockDriverAIOCB *acb;
+    int i, ret;
+    ret = bdrv_aio_multiwrite(bs, blkreq, num_writes);
+
+    if (ret != 0) {
+        for (i = 0; i < num_writes; i++) {
+            if (blkreq[i].error) {
+                virtio_blk_req_complete(blkreq[i].opaque, VIRTIO_BLK_S_IOERR);
+            }
+        }
+    }
+}
 
-    acb = bdrv_aio_writev(req->dev->bs, req->out->sector, &req->qiov,
-                          req->qiov.size / 512, virtio_blk_rw_complete, req);
-    if (!acb) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
+static void virtio_blk_handle_write(BlockRequest *blkreq, int *num_writes,
+    VirtIOBlockReq *req, BlockDriverState **old_bs)
+{
+    if (req->dev->bs != *old_bs || *num_writes == 32) {
+        if (*old_bs != NULL) {
+            do_multiwrite(*old_bs, blkreq, *num_writes);
+        }
+        *num_writes = 0;
+        *old_bs = req->dev->bs;
     }
+
+    blkreq[*num_writes].sector = req->out->sector;
+    blkreq[*num_writes].nb_sectors = req->qiov.size / 512;
+    blkreq[*num_writes].qiov = &req->qiov;
+    blkreq[*num_writes].cb = virtio_blk_rw_complete;
+    blkreq[*num_writes].opaque = req;
+    blkreq[*num_writes].error = 0;
+
+    (*num_writes)++;
 }
 
 static void virtio_blk_handle_read(VirtIOBlockReq *req)
@@ -278,6 +303,9 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBlock *s = to_virtio_blk(vdev);
     VirtIOBlockReq *req;
+    BlockRequest blkreq[32];
+    int num_writes = 0;
+    BlockDriverState *old_bs = NULL;
 
     while ((req = virtio_blk_get_request(s))) {
         if (req->elem.out_num < 1 || req->elem.in_num < 1) {
@@ -299,13 +327,18 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         } else if (req->out->type & VIRTIO_BLK_T_OUT) {
             qemu_iovec_init_external(&req->qiov, &req->elem.out_sg[1],
                                      req->elem.out_num - 1);
-            virtio_blk_handle_write(req);
+            virtio_blk_handle_write(blkreq, &num_writes, req, &old_bs);
         } else {
             qemu_iovec_init_external(&req->qiov, &req->elem.in_sg[0],
                                      req->elem.in_num - 1);
             virtio_blk_handle_read(req);
         }
     }
+
+    if (num_writes > 0) {
+        do_multiwrite(old_bs, blkreq, num_writes);
+    }
+
     /*
      * FIXME: Want to check for completions before returning to guest mode,
      * so cached reads and writes are reported as quickly as possible. But
@@ -324,7 +357,8 @@ static void virtio_blk_dma_restart_bh(void *opaque)
     s->rq = NULL;
 
     while (req) {
-        virtio_blk_handle_write(req);
+        bdrv_aio_writev(req->dev->bs, req->out->sector, &req->qiov,
+            req->qiov.size / 512, virtio_blk_rw_complete, req);
         req = req->next;
     }
 }
-- 
1.6.2.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] Re: [PATCH v3 1/2] Add bdrv_aio_multiwrite
       [not found]   ` <m3ocpj67ip.fsf@neno.mitica>
@ 2009-09-10  7:07     ` Kevin Wolf
  0 siblings, 0 replies; 10+ messages in thread
From: Kevin Wolf @ 2009-09-10  7:07 UTC (permalink / raw)
  To: Juan Quintela; +Cc: qemu-devel

Am 10.09.2009 01:08, schrieb Juan Quintela:
> Kevin Wolf <kwolf@redhat.com> wrote:
> 
> Hi again
> 
> 
>> +            QEMUIOVector *qiov = qemu_mallocz(sizeof(*qiov));
>> +            qemu_iovec_init(qiov,
>> +                reqs[outidx].qiov->niov + reqs[i].qiov->niov + 1);
>> +
>> +            // Add the first request to the merged one. If the requests are
>> +            // overlapping, drop the last sectors of the first request.
>> +            size = (reqs[i].sector - reqs[outidx].sector) << 9;
>> +            qemu_iovec_concat(qiov, reqs[outidx].qiov, size);
>> +
>> +            // We might need to add some zeros between the two requests
>> +            if (reqs[i].sector > oldreq_last) {
>> +                size_t zero_bytes = (reqs[i].sector - oldreq_last) << 9;
>> +                uint8_t *buf = qemu_blockalign(bs, zero_bytes);
>> +                memset(buf, 0, zero_bytes);
>> +                qemu_iovec_add(qiov, buf, zero_bytes);
>> +                mcb->callbacks[i].free_buf = buf;
>> +            }
>> +
>> +            // Add the second request
>> +            qemu_iovec_concat(qiov, reqs[i].qiov, reqs[i].qiov->size);
>> +
>> +            reqs[outidx].nb_sectors += reqs[i].nb_sectors;
>> +            reqs[outidx].qiov = qiov;
> 
> What frees reqs[outidx].qiov previous value, or new one for that matter?

Your decision where to stop quoting the code is kind of funny. The very
next line is:

mcb->callbacks[i].free_qiov = reqs[outidx].qiov;

This saves each newly allocated qiov. It is later freed in
multiwrite_user_cb. To free the original value of qiov is the task of
the caller.

> I can't see where it is done.  As far as I can see, we are losing both
> the ones that we are overwritten and the ones for the request that got merged.

Hope you see them now. :-)

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite
  2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf
@ 2009-09-10 22:41   ` Christoph Hellwig
  2009-09-11  7:10     ` Kevin Wolf
  0 siblings, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2009-09-10 22:41 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

On Wed, Sep 09, 2009 at 05:53:38PM +0200, Kevin Wolf wrote:
> +static void do_multiwrite(BlockDriverState *bs, BlockRequest *blkreq,
> +    int num_writes)
>  {
> -    BlockDriverAIOCB *acb;
> +    int i, ret;
> +    ret = bdrv_aio_multiwrite(bs, blkreq, num_writes);
> +
> +    if (ret != 0) {
> +        for (i = 0; i < num_writes; i++) {
> +            if (blkreq[i].error) {
> +                virtio_blk_req_complete(blkreq[i].opaque, VIRTIO_BLK_S_IOERR);
> +            }
> +        }
> +    }
> +}
>  
> +static void virtio_blk_handle_write(BlockRequest *blkreq, int *num_writes,
> +    VirtIOBlockReq *req, BlockDriverState **old_bs)
> +{
> +    if (req->dev->bs != *old_bs || *num_writes == 32) {

I was under the impression we had one queue per device, so the first
condition should never happen.

> +        if (*old_bs != NULL) {
> +            do_multiwrite(*old_bs, blkreq, *num_writes);
> +        }
> +        *num_writes = 0;
> +        *old_bs = req->dev->bs;
>      }
> +
> +    blkreq[*num_writes].sector = req->out->sector;
> +    blkreq[*num_writes].nb_sectors = req->qiov.size / 512;
> +    blkreq[*num_writes].qiov = &req->qiov;
> +    blkreq[*num_writes].cb = virtio_blk_rw_complete;
> +    blkreq[*num_writes].opaque = req;
> +    blkreq[*num_writes].error = 0;
> +
> +    (*num_writes)++;

If you pass the completion routine to the function and map the error case
to calling completion routine (which is the usual way to handle errors
anyway) this function could become copletely generic.

I think we also need to actually store the iocb in the BlockRequest
ide and scsi use it to cancel I/O on migration (why virtio
doesn't is on my TODO list to investigate) or some other cases.

Another improvement to the data structure would be to have a container
structure containing the BlockRequest array and num_writes, at which
point this actually becomes a pretty clean abstraction, which we
could also use to submit multiple iocbs in the native AIO code.

Any chance to just use this batches subsmission unconditionally and
also for reads?  I'd hate to grow even more confusing I/O methods
in the block later.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite
  2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite Kevin Wolf
       [not found]   ` <m3ocpj67ip.fsf@neno.mitica>
@ 2009-09-10 22:44   ` Christoph Hellwig
  2009-09-11  6:29     ` Kevin Wolf
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2009-09-10 22:44 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel

One thing that concerns me here is that we keep adding more memory
allocations to the I/O path.  At least on fast SSDs even kernel memory
allocations are a performance problem and they're much faster than
userspace ones.  

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite
  2009-09-10 22:44   ` [Qemu-devel] " Christoph Hellwig
@ 2009-09-11  6:29     ` Kevin Wolf
  2009-09-11 18:36       ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Kevin Wolf @ 2009-09-11  6:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel

Am 11.09.2009 00:44, schrieb Christoph Hellwig:
> One thing that concerns me here is that we keep adding more memory
> allocations to the I/O path.  At least on fast SSDs even kernel memory
> allocations are a performance problem and they're much faster than
> userspace ones.  

In the non-merging case we have one additional allocation, the one for
mcb. Maybe we could do something like the AIOPool here to avoid some
mallocs.

In the merging case we need to allocate new IO vectors. I don't see a
way around this, though, and I hope that the benefit of a saved write
request outweighs the malloc costs. If not, it probably was a bad idea
to accept your suggestion to make it generic code instead of just
merging requests for qcow2.

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite
  2009-09-10 22:41   ` Christoph Hellwig
@ 2009-09-11  7:10     ` Kevin Wolf
  2009-09-11 18:39       ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Kevin Wolf @ 2009-09-11  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel

Am 11.09.2009 00:41, schrieb Christoph Hellwig:
> On Wed, Sep 09, 2009 at 05:53:38PM +0200, Kevin Wolf wrote:
>> +static void do_multiwrite(BlockDriverState *bs, BlockRequest *blkreq,
>> +    int num_writes)
>>  {
>> -    BlockDriverAIOCB *acb;
>> +    int i, ret;
>> +    ret = bdrv_aio_multiwrite(bs, blkreq, num_writes);
>> +
>> +    if (ret != 0) {
>> +        for (i = 0; i < num_writes; i++) {
>> +            if (blkreq[i].error) {
>> +                virtio_blk_req_complete(blkreq[i].opaque, VIRTIO_BLK_S_IOERR);
>> +            }
>> +        }
>> +    }
>> +}
>>  
>> +static void virtio_blk_handle_write(BlockRequest *blkreq, int *num_writes,
>> +    VirtIOBlockReq *req, BlockDriverState **old_bs)
>> +{
>> +    if (req->dev->bs != *old_bs || *num_writes == 32) {
> 
> I was under the impression we had one queue per device, so the first
> condition should never happen.

Maybe, don't know? ;-) Makes perfect sense, so probably you're right. I
was just trying to be better safe than sorry. I can take it out if you
prefer.

>> +        if (*old_bs != NULL) {
>> +            do_multiwrite(*old_bs, blkreq, *num_writes);
>> +        }
>> +        *num_writes = 0;
>> +        *old_bs = req->dev->bs;
>>      }
>> +
>> +    blkreq[*num_writes].sector = req->out->sector;
>> +    blkreq[*num_writes].nb_sectors = req->qiov.size / 512;
>> +    blkreq[*num_writes].qiov = &req->qiov;
>> +    blkreq[*num_writes].cb = virtio_blk_rw_complete;
>> +    blkreq[*num_writes].opaque = req;
>> +    blkreq[*num_writes].error = 0;
>> +
>> +    (*num_writes)++;
> 
> If you pass the completion routine to the function and map the error case
> to calling completion routine (which is the usual way to handle errors
> anyway) this function could become copletely generic.

Except that VirtIOBlockReq doesn't seem to be a type commonly used in
generic code.

> I think we also need to actually store the iocb in the BlockRequest
> ide and scsi use it to cancel I/O on migration (why virtio
> doesn't is on my TODO list to investigate) or some other cases.

Right, this is something that is still missing. Considering that virtio
as the only multiwrite user doesn't use it (yet) anyway and that it's
not completely trivial (the request to be cancelled could have been
merged), I would prefer to introduce this in a patch on top. The
bdrv_aio_multiwrite patch is already larger than I had liked it to be.

> Another improvement to the data structure would be to have a container
> structure containing the BlockRequest array and num_writes, at which
> point this actually becomes a pretty clean abstraction, which we
> could also use to submit multiple iocbs in the native AIO code.

Ok, I'm not opposed to this.

> Any chance to just use this batches subsmission unconditionally and
> also for reads?  I'd hate to grow even more confusing I/O methods
> in the block later.

If we want to completely obsolete bdrv_aio_readv/writev by batch
submission functions (not only in block.c but also in each block
driver), we certainly can do that. I think this would make a lot of
sense, but it's quite some work and definitely out of scope for this
patch which is basically meant to be a qcow2 performance fix.

Kevin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite
  2009-09-11  6:29     ` Kevin Wolf
@ 2009-09-11 18:36       ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2009-09-11 18:36 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Christoph Hellwig, qemu-devel

On Fri, Sep 11, 2009 at 08:29:09AM +0200, Kevin Wolf wrote:
> Am 11.09.2009 00:44, schrieb Christoph Hellwig:
> > One thing that concerns me here is that we keep adding more memory
> > allocations to the I/O path.  At least on fast SSDs even kernel memory
> > allocations are a performance problem and they're much faster than
> > userspace ones.  
> 
> In the non-merging case we have one additional allocation, the one for
> mcb. Maybe we could do something like the AIOPool here to avoid some
> mallocs.
> 
> In the merging case we need to allocate new IO vectors. I don't see a
> way around this, though, and I hope that the benefit of a saved write
> request outweighs the malloc costs. If not, it probably was a bad idea
> to accept your suggestion to make it generic code instead of just
> merging requests for qcow2.

Right now I'm just thinking about the potential issues.  The problem
is that the trade offs maybe be very different for different scenarios.
For a normal disk which is by far the most common option (and will be
for a long time) the merging is a clear benefit.  For SSDs we might have
to do different optimizations.  If it's not a big issue avoiding too
many memory allocations and rather doing fewer larger ones is in general
a good thing, though.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite
  2009-09-11  7:10     ` Kevin Wolf
@ 2009-09-11 18:39       ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2009-09-11 18:39 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Christoph Hellwig, qemu-devel

On Fri, Sep 11, 2009 at 09:10:20AM +0200, Kevin Wolf wrote:
> >> +    blkreq[*num_writes].sector = req->out->sector;
> >> +    blkreq[*num_writes].nb_sectors = req->qiov.size / 512;
> >> +    blkreq[*num_writes].qiov = &req->qiov;
> >> +    blkreq[*num_writes].cb = virtio_blk_rw_complete;
> >> +    blkreq[*num_writes].opaque = req;
> >> +    blkreq[*num_writes].error = 0;
> >> +
> >> +    (*num_writes)++;
> > 
> > If you pass the completion routine to the function and map the error case
> > to calling completion routine (which is the usual way to handle errors
> > anyway) this function could become copletely generic.
> 
> Except that VirtIOBlockReq doesn't seem to be a type commonly used in
> generic code.

Yeah, we'd need to pass it only as opaque cookie and the qiov/setor
separately, making the whole thing look more similar to how the block
API works elsewhere.

> > Any chance to just use this batches subsmission unconditionally and
> > also for reads?  I'd hate to grow even more confusing I/O methods
> > in the block later.
> 
> If we want to completely obsolete bdrv_aio_readv/writev by batch
> submission functions (not only in block.c but also in each block
> driver), we certainly can do that. I think this would make a lot of
> sense, but it's quite some work and definitely out of scope for this
> patch which is basically meant to be a qcow2 performance fix.

I'm generally not a big fan of incomplete transitions, history tells
they will remaing incomplete for a long time or even forever and grow
more and more of the old calls.  The persistant existance of the non-AIO
block APIs in qemu is one of those cases..

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-09-11 18:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-09-09 15:53 [Qemu-devel] [PATCH v3 0/2] block: Handle multiple write requests at once Kevin Wolf
2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 1/2] Add bdrv_aio_multiwrite Kevin Wolf
     [not found]   ` <m3ocpj67ip.fsf@neno.mitica>
2009-09-10  7:07     ` [Qemu-devel] " Kevin Wolf
2009-09-10 22:44   ` [Qemu-devel] " Christoph Hellwig
2009-09-11  6:29     ` Kevin Wolf
2009-09-11 18:36       ` Christoph Hellwig
2009-09-09 15:53 ` [Qemu-devel] [PATCH v3 2/2] virtio-blk: Use bdrv_aio_multiwrite Kevin Wolf
2009-09-10 22:41   ` Christoph Hellwig
2009-09-11  7:10     ` Kevin Wolf
2009-09-11 18:39       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).