[Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support
@ 2015-01-30 14:32 Peter Lieven
  2015-01-30 14:32 ` [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests Peter Lieven
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Peter Lieven @ 2015-01-30 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

this series adds the long missing multiread support to virtio-blk.

some remarks:
 - i introduced rd_merged and wr_merged block accounting stats to
   blockstats as a generic interface which can be set from any
   driver that will introduce multirequest merging in the future.
 - the knob to disable request merging is not yet there. I would
   add it to the device properties also as a generic interface
   to have the same switch for any driver that might introduce
   request merging in the future. As there has been no knob in
   the past I would post this as a seperate series as it needs
   some mangling in parameter parsing which might lead to further
   discussions.
 - the old multiwrite interface is still there and might be removed.

v2->v3:
 - completely reworked Patch 4 as it turned out that depending on the
   kernel even sequential requests fly in out of order or we receive
   a mix from different software queues.

v1->v2:
 - add overflow checking for nb_sectors [Kevin]
 - do not change the name of the macro of max mergable requests. [Fam]

RFC v2->v1:
 - added Erics grammar corrections to Patch 4
 - Patch 4
  - reworked the IF tree to a SWITCH statement to make it easier
    to understand in virtio_blk_handle_request
  - stop merging if requests switch from read to write or vice
    versa. e..g, otherwise we could introduce big delays as a small
    read request could be followed by a lot of write requests
    and the read requests sits there until the queue is empty.

RFC v1->RFC v2:
 - completed Patch 1 by the example in qmp-commands.hx [Eric]
 - used bool for merge in Patch 4 [Max]
 - fixed a few typos in the commit msg of Patch 4 [Max]
 - do not start merging and directly pass req[0]->qiov in case of num_reqs == 1
 - avoid allocating memory for the multireq [Kevin]
 - do not import block_int.h and add appropiate iface to block-backend [Kevin]
 - removed debug output and added trace event for multireq [Kevin]
 - fixed alloc hint for the merge qiov [Kevin]
 - currently did not split virtio_submit_multireq into rw code since
   the redundant code would now be much bigger part than in the original patch.
 - added a merge_qiov to VirtioBlockRequest. Abusing the qiov was not possible
   because it is initialized externally with guest memory [Kevin]
 - added a pointer to VirtioBlockRequest to create a linked list
   of VirtioBlockBlockRequests. This list is used to complete all
   requests belonging to a multireq [Kevin]

Peter Lieven (4):
  block: add accounting for merged requests
  hw/virtio-blk: add a constant for max number of merged requests
  block-backend: expose bs->bl.max_transfer_length
  virtio-blk: introduce multiread

 block.c                         |   2 +
 block/accounting.c              |   7 +
 block/block-backend.c           |   5 +
 block/qapi.c                    |   2 +
 hmp.c                           |   6 +-
 hw/block/dataplane/virtio-blk.c |   8 +-
 hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
 include/block/accounting.h      |   3 +
 include/hw/virtio/virtio-blk.h  |  17 ++-
 include/sysemu/block-backend.h  |   1 +
 qapi/block-core.json            |   9 +-
 qmp-commands.hx                 |  22 ++-
 trace-events                    |   1 +
 13 files changed, 261 insertions(+), 110 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests
  2015-01-30 14:32 [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support Peter Lieven
@ 2015-01-30 14:32 ` Peter Lieven
  2015-01-30 15:53   ` Max Reitz
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of " Peter Lieven
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Lieven @ 2015-01-30 14:32 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 block.c                    |  2 ++
 block/accounting.c         |  7 +++++++
 block/qapi.c               |  2 ++
 hmp.c                      |  6 +++++-
 include/block/accounting.h |  3 +++
 qapi/block-core.json       |  9 ++++++++-
 qmp-commands.hx            | 22 ++++++++++++++++++----
 7 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/block.c b/block.c
index d45e4dd..61412e9 100644
--- a/block.c
+++ b/block.c
@@ -4562,6 +4562,8 @@ static int multiwrite_merge(BlockDriverState *bs, BlockRequest *reqs,
         }
     }
 
+    block_acct_merge_done(&bs->stats, BLOCK_ACCT_WRITE, num_reqs - outidx - 1);
+
     return outidx + 1;
 }
 
diff --git a/block/accounting.c b/block/accounting.c
index 18102f0..01d594f 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -54,3 +54,10 @@ void block_acct_highest_sector(BlockAcctStats *stats, int64_t sector_num,
         stats->wr_highest_sector = sector_num + nb_sectors - 1;
     }
 }
+
+void block_acct_merge_done(BlockAcctStats *stats, enum BlockAcctType type,
+                      int num_requests)
+{
+    assert(type < BLOCK_MAX_IOTYPE);
+    stats->merged[type] += num_requests;
+}
diff --git a/block/qapi.c b/block/qapi.c
index 75c388e..d1a8917 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -335,6 +335,8 @@ static BlockStats *bdrv_query_stats(const BlockDriverState *bs,
     s->stats->wr_bytes = bs->stats.nr_bytes[BLOCK_ACCT_WRITE];
     s->stats->rd_operations = bs->stats.nr_ops[BLOCK_ACCT_READ];
     s->stats->wr_operations = bs->stats.nr_ops[BLOCK_ACCT_WRITE];
+    s->stats->rd_merged = bs->stats.merged[BLOCK_ACCT_READ];
+    s->stats->wr_merged = bs->stats.merged[BLOCK_ACCT_WRITE];
     s->stats->wr_highest_offset =
         bs->stats.wr_highest_sector * BDRV_SECTOR_SIZE;
     s->stats->flush_operations = bs->stats.nr_ops[BLOCK_ACCT_FLUSH];
diff --git a/hmp.c b/hmp.c
index 481be80..5a10154 100644
--- a/hmp.c
+++ b/hmp.c
@@ -474,6 +474,8 @@ void hmp_info_blockstats(Monitor *mon, const QDict *qdict)
                        " wr_total_time_ns=%" PRId64
                        " rd_total_time_ns=%" PRId64
                        " flush_total_time_ns=%" PRId64
+                       " rd_merged=%" PRId64
+                       " wr_merged=%" PRId64
                        "\n",
                        stats->value->stats->rd_bytes,
                        stats->value->stats->wr_bytes,
@@ -482,7 +484,9 @@ void hmp_info_blockstats(Monitor *mon, const QDict *qdict)
                        stats->value->stats->flush_operations,
                        stats->value->stats->wr_total_time_ns,
                        stats->value->stats->rd_total_time_ns,
-                       stats->value->stats->flush_total_time_ns);
+                       stats->value->stats->flush_total_time_ns,
+                       stats->value->stats->rd_merged,
+                       stats->value->stats->wr_merged);
     }
 
     qapi_free_BlockStatsList(stats_list);
diff --git a/include/block/accounting.h b/include/block/accounting.h
index 50b42b3..4c406cf 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -39,6 +39,7 @@ typedef struct BlockAcctStats {
     uint64_t nr_bytes[BLOCK_MAX_IOTYPE];
     uint64_t nr_ops[BLOCK_MAX_IOTYPE];
     uint64_t total_time_ns[BLOCK_MAX_IOTYPE];
+    uint64_t merged[BLOCK_MAX_IOTYPE];
     uint64_t wr_highest_sector;
 } BlockAcctStats;
 
@@ -53,5 +54,7 @@ void block_acct_start(BlockAcctStats *stats, BlockAcctCookie *cookie,
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie);
 void block_acct_highest_sector(BlockAcctStats *stats, int64_t sector_num,
                                unsigned int nb_sectors);
+void block_acct_merge_done(BlockAcctStats *stats, enum BlockAcctType type,
+                           int num_requests);
 
 #endif
diff --git a/qapi/block-core.json b/qapi/block-core.json
index 80984d1..b7d9772 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -407,13 +407,20 @@
 #                     growable sparse files (like qcow2) that are used on top
 #                     of a physical device.
 #
+# @rd_merged: Number of read requests that have been merged into another
+#             request (Since 2.3).
+#
+# @wr_merged: Number of write requests that have been merged into another
+#             request (Since 2.3).
+#
 # Since: 0.14.0
 ##
 { 'type': 'BlockDeviceStats',
   'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'rd_operations': 'int',
            'wr_operations': 'int', 'flush_operations': 'int',
            'flush_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-           'rd_total_time_ns': 'int', 'wr_highest_offset': 'int' } }
+           'rd_total_time_ns': 'int', 'wr_highest_offset': 'int',
+           'rd_merged': 'int', 'wr_merged': 'int' } }
 
 ##
 # @BlockStats:
diff --git a/qmp-commands.hx b/qmp-commands.hx
index c5f16dd..af3fd19 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -2303,6 +2303,10 @@ Each json-object contain the following:
     - "flush_total_time_ns": total time spend on cache flushes in nano-seconds (json-int)
     - "wr_highest_offset": Highest offset of a sector written since the
                            BlockDriverState has been opened (json-int)
+    - "rd_merged": number of read requests that have been merged into
+                   another request (json-int)
+    - "wr_merged": number of write requests that have been merged into
+                   another request (json-int)
 - "parent": Contains recursively the statistics of the underlying
             protocol (e.g. the host file for a qcow2 image). If there is
             no underlying protocol, this field is omitted
@@ -2326,6 +2330,8 @@ Example:
                   "rd_total_times_ns":3465673657
                   "flush_total_times_ns":49653
                   "flush_operations":61,
+                  "rd_merged":0,
+                  "wr_merged":0
                }
             },
             "stats":{
@@ -2337,7 +2343,9 @@ Example:
                "flush_operations":51,
                "wr_total_times_ns":313253456
                "rd_total_times_ns":3465673657
-               "flush_total_times_ns":49653
+               "flush_total_times_ns":49653,
+               "rd_merged":0,
+               "wr_merged":0
             }
          },
          {
@@ -2351,7 +2359,9 @@ Example:
                "flush_operations":0,
                "wr_total_times_ns":0
                "rd_total_times_ns":0
-               "flush_total_times_ns":0
+               "flush_total_times_ns":0,
+               "rd_merged":0,
+               "wr_merged":0
             }
          },
          {
@@ -2365,7 +2375,9 @@ Example:
                "flush_operations":0,
                "wr_total_times_ns":0
                "rd_total_times_ns":0
-               "flush_total_times_ns":0
+               "flush_total_times_ns":0,
+               "rd_merged":0,
+               "wr_merged":0
             }
          },
          {
@@ -2379,7 +2391,9 @@ Example:
                "flush_operations":0,
                "wr_total_times_ns":0
                "rd_total_times_ns":0
-               "flush_total_times_ns":0
+               "flush_total_times_ns":0,
+               "rd_merged":0,
+               "wr_merged":0
             }
          }
       ]
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of merged requests
  2015-01-30 14:32 [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support Peter Lieven
  2015-01-30 14:32 ` [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests Peter Lieven
@ 2015-01-30 14:33 ` Peter Lieven
  2015-01-30 16:01   ` Max Reitz
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length Peter Lieven
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread Peter Lieven
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Lieven @ 2015-01-30 14:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

As it was not obvious (at least for me) where the 32 comes from;
add a constant for it.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 hw/block/virtio-blk.c          | 2 +-
 include/hw/virtio/virtio-blk.h | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index 4032fca..e04adb8 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -360,7 +360,7 @@ static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
     block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
                      BLOCK_ACCT_WRITE);
 
-    if (mrb->num_writes == 32) {
+    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
         virtio_submit_multiwrite(req->dev->blk, mrb);
     }
 
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 4652b70..6ef3fa5 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -134,8 +134,10 @@ typedef struct VirtIOBlock {
     struct VirtIOBlockDataPlane *dataplane;
 } VirtIOBlock;
 
+#define VIRTIO_BLK_MAX_MERGE_REQS 32
+
 typedef struct MultiReqBuffer {
-    BlockRequest        blkreq[32];
+    BlockRequest        blkreq[VIRTIO_BLK_MAX_MERGE_REQS];
     unsigned int        num_writes;
 } MultiReqBuffer;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length
  2015-01-30 14:32 [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support Peter Lieven
  2015-01-30 14:32 ` [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests Peter Lieven
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of " Peter Lieven
@ 2015-01-30 14:33 ` Peter Lieven
  2015-01-30 16:04   ` Max Reitz
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread Peter Lieven
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Lieven @ 2015-01-30 14:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block/block-backend.c          | 5 +++++
 include/sysemu/block-backend.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index d00c129..c28e240 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -580,6 +580,11 @@ int blk_get_flags(BlockBackend *blk)
     return bdrv_get_flags(blk->bs);
 }
 
+int blk_get_max_transfer_length(BlockBackend *blk)
+{
+    return blk->bs->bl.max_transfer_length;
+}
+
 void blk_set_guest_block_size(BlockBackend *blk, int align)
 {
     bdrv_set_guest_block_size(blk->bs, align);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index 8871a02..aab12b9 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -127,6 +127,7 @@ int blk_is_inserted(BlockBackend *blk);
 void blk_lock_medium(BlockBackend *blk, bool locked);
 void blk_eject(BlockBackend *blk, bool eject_flag);
 int blk_get_flags(BlockBackend *blk);
+int blk_get_max_transfer_length(BlockBackend *blk);
 void blk_set_guest_block_size(BlockBackend *blk, int align);
 void *blk_blockalign(BlockBackend *blk, size_t size);
 bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 14:32 [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support Peter Lieven
                   ` (2 preceding siblings ...)
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length Peter Lieven
@ 2015-01-30 14:33 ` Peter Lieven
  2015-01-30 17:16   ` Max Reitz
  3 siblings, 1 reply; 14+ messages in thread
From: Peter Lieven @ 2015-01-30 14:33 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, Peter Lieven, armbru, mreitz,
	stefanha, pbonzini

this patch finally introduces multiread support to virtio-blk. While
multiwrite support was there for a long time, read support was missing.

The complete merge logic is moved into virtio-blk.c which has
been the only user of request merging ever since. This is required
to be able to merge chunks of requests and immediately invoke callbacks
for those requests. Secondly, this is required to switch to
direct invocation of coroutines which is planned at a later stage.

The following benchmarks show the performance of running fio with
4 worker threads on a local ram disk. The numbers show the average
of 10 test runs after 1 run as warmup phase.

              |        4k        |       64k        |        4k
MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
--------------+--------+---------+--------+---------+--------+--------
master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 hw/block/dataplane/virtio-blk.c |   8 +-
 hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
 include/hw/virtio/virtio-blk.h  |  19 +--
 trace-events                    |   1 +
 4 files changed, 210 insertions(+), 106 deletions(-)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 39c5d71..93472e9 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
     event_notifier_test_and_clear(&s->host_notifier);
     blk_io_plug(s->conf->conf.blk);
     for (;;) {
-        MultiReqBuffer mrb = {
-            .num_writes = 0,
-        };
+        MultiReqBuffer mrb = {};
         int ret;
 
         /* Disable guest->host notifies to avoid unnecessary vmexits */
@@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
             virtio_blk_handle_request(req, &mrb);
         }
 
-        virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
+        if (mrb.num_reqs) {
+            virtio_submit_multireq(s->conf->conf.blk, &mrb);
+        }
 
         if (likely(ret == -EAGAIN)) { /* vring emptied */
             /* Re-enable guest->host notifies and stop processing the vring.
diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
index e04adb8..77890a0 100644
--- a/hw/block/virtio-blk.c
+++ b/hw/block/virtio-blk.c
@@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
     req->dev = s;
     req->qiov.size = 0;
     req->next = NULL;
+    req->mr_next = NULL;
     return req;
 }
 
@@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
 
 static void virtio_blk_rw_complete(void *opaque, int ret)
 {
-    VirtIOBlockReq *req = opaque;
+    VirtIOBlockReq *next = opaque;
 
-    trace_virtio_blk_rw_complete(req, ret);
+    while (next) {
+        VirtIOBlockReq *req = next;
+        next = req->mr_next;
+        trace_virtio_blk_rw_complete(req, ret);
 
-    if (ret) {
-        int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
-        bool is_read = !(p & VIRTIO_BLK_T_OUT);
-        if (virtio_blk_handle_rw_error(req, -ret, is_read))
-            return;
-    }
+        if (req->qiov.nalloc != -1) {
+            qemu_iovec_destroy(&req->qiov);
+        }
 
-    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
-    block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
-    virtio_blk_free_request(req);
+        if (ret) {
+            int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
+            bool is_read = !(p & VIRTIO_BLK_T_OUT);
+            if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
+                continue;
+            }
+        }
+
+        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
+        block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
+        virtio_blk_free_request(req);
+    }
 }
 
 static void virtio_blk_flush_complete(void *opaque, int ret)
@@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
     }
 }
 
-void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
+static inline void
+virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,
+                        int start, int num_reqs, int niov)
 {
-    int i, ret;
+    QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
+    int64_t sector_num = mrb->reqs[start]->sector_num;
+    int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
+    bool is_write = mrb->is_write;
+
+    if (num_reqs > 1) {
+        int i;
+        struct iovec *_iov = qiov->iov;
+        int _niov = qiov->niov;
+
+        qemu_iovec_init(qiov, niov);
+
+        for (i = 0; i < _niov; i++) {
+            qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
+        }
 
-    if (!mrb->num_writes) {
+        for (i = start + 1; i < start + num_reqs; i++) {
+            qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
+                              mrb->reqs[i]->qiov.size);
+            mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
+            nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
+        }
+        assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
+
+        trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
+                                         nb_sectors, is_write);
+        block_acct_merge_done(blk_get_stats(blk),
+                              is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
+                              num_reqs - 1);
+    }
+
+    if (is_write) {
+        blk_aio_writev(blk, sector_num, qiov, nb_sectors,
+                       virtio_blk_rw_complete, mrb->reqs[start]);
+    } else {
+        blk_aio_readv(blk, sector_num, qiov, nb_sectors,
+                      virtio_blk_rw_complete, mrb->reqs[start]);
+    }
+}
+
+static int virtio_multireq_compare(const void *a, const void *b)
+{
+    const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
+                         *req2 = *(VirtIOBlockReq **)b;
+
+    /*
+     * Note that we can't simply subtract sector_num1 from sector_num2
+     * here as that could overflow the return value.
+     */
+    if (req1->sector_num > req2->sector_num) {
+        return 1;
+    } else if (req1->sector_num < req2->sector_num) {
+        return -1;
+    } else {
+        return 0;
+    }
+}
+
+void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
+{
+    int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
+        max_xfer_len = 0;
+    int64_t sector_num = 0;
+
+    if (mrb->num_reqs == 1) {
+        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
+        mrb->num_reqs = 0;
         return;
     }
 
-    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
-    if (ret != 0) {
-        for (i = 0; i < mrb->num_writes; i++) {
-            if (mrb->blkreq[i].error) {
-                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
+    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
+    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
+
+    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
+          &virtio_multireq_compare);
+
+    for (i = 0; i < mrb->num_reqs; i++) {
+        VirtIOBlockReq *req = mrb->reqs[i];
+        if (num_reqs > 0) {
+            bool merge = true;
+
+            /* merge would exceed maximum number of IOVs */
+            if (niov + req->qiov.niov + 1 > IOV_MAX) {
+                merge = false;
+            }
+
+            /* merge would exceed maximum transfer length of backend device */
+            if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > max_xfer_len) {
+                merge = false;
+            }
+
+            /* requests are not sequential */
+            if (sector_num + nb_sectors != req->sector_num) {
+                merge = false;
+            }
+
+            if (!merge) {
+                virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
+                num_reqs = 0;
             }
         }
+
+        if (num_reqs == 0) {
+            sector_num = req->sector_num;
+            nb_sectors = niov = 0;
+            start = i;
+        }
+
+        nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
+        niov += req->qiov.niov;
+        num_reqs++;
     }
 
-    mrb->num_writes = 0;
+    virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
+    mrb->num_reqs = 0;
 }
 
 static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
@@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
     /*
      * Make sure all outstanding writes are posted to the backing device.
      */
-    virtio_submit_multiwrite(req->dev->blk, mrb);
+    if (mrb->is_write && mrb->num_reqs > 0) {
+        virtio_submit_multireq(req->dev->blk, mrb);
+    }
     blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
 }
 
@@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
     uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
     uint64_t total_sectors;
 
+    if (nb_sectors > INT_MAX) {
+        return false;
+    }
     if (sector & dev->sector_mask) {
         return false;
     }
@@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
     return true;
 }
 
-static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
-{
-    BlockRequest *blkreq;
-    uint64_t sector;
-
-    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
-
-    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
-
-    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
-        virtio_blk_free_request(req);
-        return;
-    }
-
-    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
-                     BLOCK_ACCT_WRITE);
-
-    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
-        virtio_submit_multiwrite(req->dev->blk, mrb);
-    }
-
-    blkreq = &mrb->blkreq[mrb->num_writes];
-    blkreq->sector = sector;
-    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
-    blkreq->qiov = &req->qiov;
-    blkreq->cb = virtio_blk_rw_complete;
-    blkreq->opaque = req;
-    blkreq->error = 0;
-
-    mrb->num_writes++;
-}
-
-static void virtio_blk_handle_read(VirtIOBlockReq *req)
-{
-    uint64_t sector;
-
-    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
-
-    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
-
-    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
-        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
-        virtio_blk_free_request(req);
-        return;
-    }
-
-    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
-                     BLOCK_ACCT_READ);
-    blk_aio_readv(req->dev->blk, sector, &req->qiov,
-                  req->qiov.size / BDRV_SECTOR_SIZE,
-                  virtio_blk_rw_complete, req);
-}
-
 void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
 {
     uint32_t type;
@@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
 
     type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
 
-    if (type & VIRTIO_BLK_T_FLUSH) {
+    switch (type & 0xff) {
+    case VIRTIO_BLK_T_IN:
+    case VIRTIO_BLK_T_OUT:
+    {
+        bool is_write = type & VIRTIO_BLK_T_OUT;
+        req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
+                                       &req->out.sector);
+
+        if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
+                                      req->qiov.size)) {
+            virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
+            virtio_blk_free_request(req);
+            return;
+        }
+
+        if (is_write) {
+            qemu_iovec_init_external(&req->qiov, iov, out_num);
+            trace_virtio_blk_handle_write(req, req->sector_num,
+                                          req->qiov.size / BDRV_SECTOR_SIZE);
+        } else {
+            qemu_iovec_init_external(&req->qiov, in_iov, in_num);
+            trace_virtio_blk_handle_read(req, req->sector_num,
+                                         req->qiov.size / BDRV_SECTOR_SIZE);
+        }
+
+        block_acct_start(blk_get_stats(req->dev->blk),
+                         &req->acct, req->qiov.size,
+                         is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
+
+        /* merge would exceed maximum number of requests or IO direction
+         * changes */
+        if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
+                                  is_write != mrb->is_write)) {
+            virtio_submit_multireq(req->dev->blk, mrb);
+        }
+
+        mrb->reqs[mrb->num_reqs++] = req;
+        mrb->is_write = is_write;
+        break;
+    }
+    case VIRTIO_BLK_T_FLUSH:
         virtio_blk_handle_flush(req, mrb);
-    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
+        break;
+    case VIRTIO_BLK_T_SCSI_CMD:
         virtio_blk_handle_scsi(req);
-    } else if (type & VIRTIO_BLK_T_GET_ID) {
+        break;
+    case VIRTIO_BLK_T_GET_ID:
+    {
         VirtIOBlock *s = req->dev;
 
         /*
@@ -448,14 +553,9 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
         iov_from_buf(in_iov, in_num, 0, serial, size);
         virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
         virtio_blk_free_request(req);
-    } else if (type & VIRTIO_BLK_T_OUT) {
-        qemu_iovec_init_external(&req->qiov, iov, out_num);
-        virtio_blk_handle_write(req, mrb);
-    } else if (type == VIRTIO_BLK_T_IN || type == VIRTIO_BLK_T_BARRIER) {
-        /* VIRTIO_BLK_T_IN is 0, so we can't just & it. */
-        qemu_iovec_init_external(&req->qiov, in_iov, in_num);
-        virtio_blk_handle_read(req);
-    } else {
+        break;
+    }
+    default:
         virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
         virtio_blk_free_request(req);
     }
@@ -465,9 +565,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
 {
     VirtIOBlock *s = VIRTIO_BLK(vdev);
     VirtIOBlockReq *req;
-    MultiReqBuffer mrb = {
-        .num_writes = 0,
-    };
+    MultiReqBuffer mrb = {};
 
     /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
      * dataplane here instead of waiting for .set_status().
@@ -481,7 +579,9 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         virtio_blk_handle_request(req, &mrb);
     }
 
-    virtio_submit_multiwrite(s->blk, &mrb);
+    if (mrb.num_reqs) {
+        virtio_submit_multireq(s->blk, &mrb);
+    }
 
     /*
      * FIXME: Want to check for completions before returning to guest mode,
@@ -494,9 +594,7 @@ static void virtio_blk_dma_restart_bh(void *opaque)
 {
     VirtIOBlock *s = opaque;
     VirtIOBlockReq *req = s->rq;
-    MultiReqBuffer mrb = {
-        .num_writes = 0,
-    };
+    MultiReqBuffer mrb = {};
 
     qemu_bh_delete(s->bh);
     s->bh = NULL;
@@ -509,7 +607,9 @@ static void virtio_blk_dma_restart_bh(void *opaque)
         req = next;
     }
 
-    virtio_submit_multiwrite(s->blk, &mrb);
+    if (mrb.num_reqs) {
+        virtio_submit_multireq(s->blk, &mrb);
+    }
 }
 
 static void virtio_blk_dma_restart_cb(void *opaque, int running,
diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
index 6ef3fa5..2feeb57 100644
--- a/include/hw/virtio/virtio-blk.h
+++ b/include/hw/virtio/virtio-blk.h
@@ -134,29 +134,32 @@ typedef struct VirtIOBlock {
     struct VirtIOBlockDataPlane *dataplane;
 } VirtIOBlock;
 
-#define VIRTIO_BLK_MAX_MERGE_REQS 32
-
-typedef struct MultiReqBuffer {
-    BlockRequest        blkreq[VIRTIO_BLK_MAX_MERGE_REQS];
-    unsigned int        num_writes;
-} MultiReqBuffer;
-
 typedef struct VirtIOBlockReq {
+    int64_t sector_num;
     VirtIOBlock *dev;
     VirtQueueElement elem;
     struct virtio_blk_inhdr *in;
     struct virtio_blk_outhdr out;
     QEMUIOVector qiov;
     struct VirtIOBlockReq *next;
+    struct VirtIOBlockReq *mr_next;
     BlockAcctCookie acct;
 } VirtIOBlockReq;
 
+#define VIRTIO_BLK_MAX_MERGE_REQS 32
+
+typedef struct MultiReqBuffer {
+    VirtIOBlockReq *reqs[VIRTIO_BLK_MAX_MERGE_REQS];
+    unsigned int num_reqs;
+    bool is_write;
+} MultiReqBuffer;
+
 VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s);
 
 void virtio_blk_free_request(VirtIOBlockReq *req);
 
 void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb);
 
-void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb);
+void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb);
 
 #endif
diff --git a/trace-events b/trace-events
index 4ec81eb..9ef3b68 100644
--- a/trace-events
+++ b/trace-events
@@ -116,6 +116,7 @@ virtio_blk_req_complete(void *req, int status) "req %p status %d"
 virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
 virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
 virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
+virtio_blk_submit_multireq(void *mrb, int start, int num_reqs, uint64_t sector, size_t nsectors, bool is_write) "mrb %p start %d num_reqs %d sector %"PRIu64" nsectors %zu is_write %d"
 
 # hw/block/dataplane/virtio-blk.c
 virtio_blk_data_plane_start(void *s) "dataplane %p"
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests
  2015-01-30 14:32 ` [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests Peter Lieven
@ 2015-01-30 15:53   ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2015-01-30 15:53 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

On 2015-01-30 at 09:32, Peter Lieven wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>   block.c                    |  2 ++
>   block/accounting.c         |  7 +++++++
>   block/qapi.c               |  2 ++
>   hmp.c                      |  6 +++++-
>   include/block/accounting.h |  3 +++
>   qapi/block-core.json       |  9 ++++++++-
>   qmp-commands.hx            | 22 ++++++++++++++++++----
>   7 files changed, 45 insertions(+), 6 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of merged requests
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of " Peter Lieven
@ 2015-01-30 16:01   ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2015-01-30 16:01 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

On 2015-01-30 at 09:33, Peter Lieven wrote:
> As it was not obvious (at least for me) where the 32 comes from;
> add a constant for it.
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> ---
>   hw/block/virtio-blk.c          | 2 +-
>   include/hw/virtio/virtio-blk.h | 4 +++-
>   2 files changed, 4 insertions(+), 2 deletions(-)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length Peter Lieven
@ 2015-01-30 16:04   ` Max Reitz
  0 siblings, 0 replies; 14+ messages in thread
From: Max Reitz @ 2015-01-30 16:04 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

On 2015-01-30 at 09:33, Peter Lieven wrote:
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>   block/block-backend.c          | 5 +++++
>   include/sysemu/block-backend.h | 1 +
>   2 files changed, 6 insertions(+)

Reviewed-by: Max Reitz <mreitz@redhat.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread Peter Lieven
@ 2015-01-30 17:16   ` Max Reitz
  2015-01-30 21:05     ` Peter Lieven
  0 siblings, 1 reply; 14+ messages in thread
From: Max Reitz @ 2015-01-30 17:16 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

On 2015-01-30 at 09:33, Peter Lieven wrote:
> this patch finally introduces multiread support to virtio-blk. While
> multiwrite support was there for a long time, read support was missing.
>
> The complete merge logic is moved into virtio-blk.c which has
> been the only user of request merging ever since. This is required
> to be able to merge chunks of requests and immediately invoke callbacks
> for those requests. Secondly, this is required to switch to
> direct invocation of coroutines which is planned at a later stage.
>
> The following benchmarks show the performance of running fio with
> 4 worker threads on a local ram disk. The numbers show the average
> of 10 test runs after 1 run as warmup phase.
>
>                |        4k        |       64k        |        4k
> MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
> --------------+--------+---------+--------+---------+--------+--------
> master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
> multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>   hw/block/dataplane/virtio-blk.c |   8 +-
>   hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
>   include/hw/virtio/virtio-blk.h  |  19 +--
>   trace-events                    |   1 +
>   4 files changed, 210 insertions(+), 106 deletions(-)
>
> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
> index 39c5d71..93472e9 100644
> --- a/hw/block/dataplane/virtio-blk.c
> +++ b/hw/block/dataplane/virtio-blk.c
> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
>       event_notifier_test_and_clear(&s->host_notifier);
>       blk_io_plug(s->conf->conf.blk);
>       for (;;) {
> -        MultiReqBuffer mrb = {
> -            .num_writes = 0,
> -        };
> +        MultiReqBuffer mrb = {};
>           int ret;
>   
>           /* Disable guest->host notifies to avoid unnecessary vmexits */
> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
>               virtio_blk_handle_request(req, &mrb);
>           }
>   
> -        virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
> +        if (mrb.num_reqs) {
> +            virtio_submit_multireq(s->conf->conf.blk, &mrb);
> +        }
>   
>           if (likely(ret == -EAGAIN)) { /* vring emptied */
>               /* Re-enable guest->host notifies and stop processing the vring.
> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
> index e04adb8..77890a0 100644
> --- a/hw/block/virtio-blk.c
> +++ b/hw/block/virtio-blk.c
> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>       req->dev = s;
>       req->qiov.size = 0;
>       req->next = NULL;
> +    req->mr_next = NULL;
>       return req;
>   }
>   
> @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
>   
>   static void virtio_blk_rw_complete(void *opaque, int ret)
>   {
> -    VirtIOBlockReq *req = opaque;
> +    VirtIOBlockReq *next = opaque;
>   
> -    trace_virtio_blk_rw_complete(req, ret);
> +    while (next) {
> +        VirtIOBlockReq *req = next;
> +        next = req->mr_next;
> +        trace_virtio_blk_rw_complete(req, ret);
>   
> -    if (ret) {
> -        int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
> -        bool is_read = !(p & VIRTIO_BLK_T_OUT);
> -        if (virtio_blk_handle_rw_error(req, -ret, is_read))
> -            return;
> -    }
> +        if (req->qiov.nalloc != -1) {
> +            qemu_iovec_destroy(&req->qiov);
> +        }
>   
> -    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
> -    block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
> -    virtio_blk_free_request(req);
> +        if (ret) {
> +            int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
> +            bool is_read = !(p & VIRTIO_BLK_T_OUT);
> +            if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
> +                continue;
> +            }
> +        }
> +
> +        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
> +        block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
> +        virtio_blk_free_request(req);
> +    }
>   }
>   
>   static void virtio_blk_flush_complete(void *opaque, int ret)
> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
>       }
>   }
>   
> -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
> +static inline void

Is the "inline" really worth it? This function is rather long...

(To my surprise, gcc actually does inline the function)

> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,

Hm, that's not a very descriptive name... Maybe "submit_merged_requests" 
or something like that (it's static, so you can drop the virtio_ prefix 
if you want to) would be better?

> +                        int start, int num_reqs, int niov)
>   {
> -    int i, ret;
> +    QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
> +    int64_t sector_num = mrb->reqs[start]->sector_num;
> +    int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
> +    bool is_write = mrb->is_write;
> +
> +    if (num_reqs > 1) {
> +        int i;
> +        struct iovec *_iov = qiov->iov;
> +        int _niov = qiov->niov;

Identifiers with leading underscores are considered reserved, I'd rather 
avoid using them.

> +
> +        qemu_iovec_init(qiov, niov);
> +
> +        for (i = 0; i < _niov; i++) {
> +            qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
> +        }
>   
> -    if (!mrb->num_writes) {
> +        for (i = start + 1; i < start + num_reqs; i++) {
> +            qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
> +                              mrb->reqs[i]->qiov.size);
> +            mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
> +            nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
> +        }
> +        assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
> +
> +        trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
> +                                         nb_sectors, is_write);
> +        block_acct_merge_done(blk_get_stats(blk),
> +                              is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
> +                              num_reqs - 1);
> +    }
> +
> +    if (is_write) {
> +        blk_aio_writev(blk, sector_num, qiov, nb_sectors,
> +                       virtio_blk_rw_complete, mrb->reqs[start]);
> +    } else {
> +        blk_aio_readv(blk, sector_num, qiov, nb_sectors,
> +                      virtio_blk_rw_complete, mrb->reqs[start]);
> +    }
> +}
> +
> +static int virtio_multireq_compare(const void *a, const void *b)
> +{
> +    const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
> +                         *req2 = *(VirtIOBlockReq **)b;
> +
> +    /*
> +     * Note that we can't simply subtract sector_num1 from sector_num2
> +     * here as that could overflow the return value.
> +     */
> +    if (req1->sector_num > req2->sector_num) {
> +        return 1;
> +    } else if (req1->sector_num < req2->sector_num) {
> +        return -1;
> +    } else {
> +        return 0;
> +    }
> +}
> +
> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
> +{
> +    int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
> +        max_xfer_len = 0;

Albeit not necessary, starting a new "int" line won't take any more 
space (and would look nicer to me, probably).

> +    int64_t sector_num = 0;
> +
> +    if (mrb->num_reqs == 1) {
> +        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
> +        mrb->num_reqs = 0;
>           return;
>       }
>   
> -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
> -    if (ret != 0) {
> -        for (i = 0; i < mrb->num_writes; i++) {
> -            if (mrb->blkreq[i].error) {
> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
> +    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
> +    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
> +
> +    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
> +          &virtio_multireq_compare);
> +
> +    for (i = 0; i < mrb->num_reqs; i++) {
> +        VirtIOBlockReq *req = mrb->reqs[i];
> +        if (num_reqs > 0) {
> +            bool merge = true;
> +
> +            /* merge would exceed maximum number of IOVs */
> +            if (niov + req->qiov.niov + 1 > IOV_MAX) {

Hm, why the +1?

> +                merge = false;
> +            }
> +
> +            /* merge would exceed maximum transfer length of backend device */
> +            if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > max_xfer_len) {
> +                merge = false;
> +            }
> +
> +            /* requests are not sequential */
> +            if (sector_num + nb_sectors != req->sector_num) {
> +                merge = false;
> +            }
> +
> +            if (!merge) {
> +                virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
> +                num_reqs = 0;
>               }
>           }
> +
> +        if (num_reqs == 0) {
> +            sector_num = req->sector_num;
> +            nb_sectors = niov = 0;
> +            start = i;
> +        }
> +
> +        nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
> +        niov += req->qiov.niov;
> +        num_reqs++;
>       }
>   
> -    mrb->num_writes = 0;
> +    virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
> +    mrb->num_reqs = 0;
>   }
>   
>   static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>       /*
>        * Make sure all outstanding writes are posted to the backing device.
>        */
> -    virtio_submit_multiwrite(req->dev->blk, mrb);
> +    if (mrb->is_write && mrb->num_reqs > 0) {
> +        virtio_submit_multireq(req->dev->blk, mrb);
> +    }
>       blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>   }
>   
> @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>       uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
>       uint64_t total_sectors;
>   
> +    if (nb_sectors > INT_MAX) {
> +        return false;
> +    }
>       if (sector & dev->sector_mask) {
>           return false;
>       }
> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>       return true;
>   }
>   
> -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
> -{
> -    BlockRequest *blkreq;
> -    uint64_t sector;
> -
> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
> -
> -    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
> -
> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
> -        virtio_blk_free_request(req);
> -        return;
> -    }
> -
> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
> -                     BLOCK_ACCT_WRITE);
> -
> -    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
> -        virtio_submit_multiwrite(req->dev->blk, mrb);
> -    }
> -
> -    blkreq = &mrb->blkreq[mrb->num_writes];
> -    blkreq->sector = sector;
> -    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
> -    blkreq->qiov = &req->qiov;
> -    blkreq->cb = virtio_blk_rw_complete;
> -    blkreq->opaque = req;
> -    blkreq->error = 0;
> -
> -    mrb->num_writes++;
> -}
> -
> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
> -{
> -    uint64_t sector;
> -
> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
> -
> -    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
> -
> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
> -        virtio_blk_free_request(req);
> -        return;
> -    }
> -
> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
> -                     BLOCK_ACCT_READ);
> -    blk_aio_readv(req->dev->blk, sector, &req->qiov,
> -                  req->qiov.size / BDRV_SECTOR_SIZE,
> -                  virtio_blk_rw_complete, req);
> -}
> -
>   void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>   {
>       uint32_t type;
> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>   
>       type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>   
> -    if (type & VIRTIO_BLK_T_FLUSH) {
> +    switch (type & 0xff) {

Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too.

> +    case VIRTIO_BLK_T_IN:
> +    case VIRTIO_BLK_T_OUT:
> +    {
> +        bool is_write = type & VIRTIO_BLK_T_OUT;
> +        req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
> +                                       &req->out.sector);
> +
> +        if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
> +                                      req->qiov.size)) {
> +            virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
> +            virtio_blk_free_request(req);
> +            return;
> +        }
> +
> +        if (is_write) {
> +            qemu_iovec_init_external(&req->qiov, iov, out_num);
> +            trace_virtio_blk_handle_write(req, req->sector_num,
> +                                          req->qiov.size / BDRV_SECTOR_SIZE);
> +        } else {
> +            qemu_iovec_init_external(&req->qiov, in_iov, in_num);
> +            trace_virtio_blk_handle_read(req, req->sector_num,
> +                                         req->qiov.size / BDRV_SECTOR_SIZE);
> +        }

Before this patch, req->qiov (and subsequently req->qiov.size) was 
initialized before virtio_blk_sect_range_ok() was called (with 
req->qiov.size as an argument). Is it fine to change that?

> +
> +        block_acct_start(blk_get_stats(req->dev->blk),
> +                         &req->acct, req->qiov.size,
> +                         is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> +
> +        /* merge would exceed maximum number of requests or IO direction
> +         * changes */
> +        if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
> +                                  is_write != mrb->is_write)) {
> +            virtio_submit_multireq(req->dev->blk, mrb);
> +        }
> +
> +        mrb->reqs[mrb->num_reqs++] = req;

Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) 
before this. Feel free to ignore my request.

> +        mrb->is_write = is_write;
> +        break;
> +    }
> +    case VIRTIO_BLK_T_FLUSH:

I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was 
"type & VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_...").

>           virtio_blk_handle_flush(req, mrb);
> -    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
> +        break;
> +    case VIRTIO_BLK_T_SCSI_CMD:

And VIRTIO_BLK_T_SCSI_CMD_OUT here.


The overall logic of this patch seems good to me (although I had some 
trouble following which QIOVs had which purpose, which were allocated 
and which were not, and so on...).

Max

>           virtio_blk_handle_scsi(req);
> -    } else if (type & VIRTIO_BLK_T_GET_ID) {
> +        break;
> +    case VIRTIO_BLK_T_GET_ID:
> +    {
>           VirtIOBlock *s = req->dev;
>   
>           /*
> @@ -448,14 +553,9 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>           iov_from_buf(in_iov, in_num, 0, serial, size);
>           virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>           virtio_blk_free_request(req);
> -    } else if (type & VIRTIO_BLK_T_OUT) {
> -        qemu_iovec_init_external(&req->qiov, iov, out_num);
> -        virtio_blk_handle_write(req, mrb);
> -    } else if (type == VIRTIO_BLK_T_IN || type == VIRTIO_BLK_T_BARRIER) {
> -        /* VIRTIO_BLK_T_IN is 0, so we can't just & it. */
> -        qemu_iovec_init_external(&req->qiov, in_iov, in_num);
> -        virtio_blk_handle_read(req);
> -    } else {
> +        break;
> +    }
> +    default:
>           virtio_blk_req_complete(req, VIRTIO_BLK_S_UNSUPP);
>           virtio_blk_free_request(req);
>       }
> @@ -465,9 +565,7 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>   {
>       VirtIOBlock *s = VIRTIO_BLK(vdev);
>       VirtIOBlockReq *req;
> -    MultiReqBuffer mrb = {
> -        .num_writes = 0,
> -    };
> +    MultiReqBuffer mrb = {};
>   
>       /* Some guests kick before setting VIRTIO_CONFIG_S_DRIVER_OK so start
>        * dataplane here instead of waiting for .set_status().
> @@ -481,7 +579,9 @@ static void virtio_blk_handle_output(VirtIODevice *vdev, VirtQueue *vq)
>           virtio_blk_handle_request(req, &mrb);
>       }
>   
> -    virtio_submit_multiwrite(s->blk, &mrb);
> +    if (mrb.num_reqs) {
> +        virtio_submit_multireq(s->blk, &mrb);
> +    }
>   
>       /*
>        * FIXME: Want to check for completions before returning to guest mode,
> @@ -494,9 +594,7 @@ static void virtio_blk_dma_restart_bh(void *opaque)
>   {
>       VirtIOBlock *s = opaque;
>       VirtIOBlockReq *req = s->rq;
> -    MultiReqBuffer mrb = {
> -        .num_writes = 0,
> -    };
> +    MultiReqBuffer mrb = {};
>   
>       qemu_bh_delete(s->bh);
>       s->bh = NULL;
> @@ -509,7 +607,9 @@ static void virtio_blk_dma_restart_bh(void *opaque)
>           req = next;
>       }
>   
> -    virtio_submit_multiwrite(s->blk, &mrb);
> +    if (mrb.num_reqs) {
> +        virtio_submit_multireq(s->blk, &mrb);
> +    }
>   }
>   
>   static void virtio_blk_dma_restart_cb(void *opaque, int running,
> diff --git a/include/hw/virtio/virtio-blk.h b/include/hw/virtio/virtio-blk.h
> index 6ef3fa5..2feeb57 100644
> --- a/include/hw/virtio/virtio-blk.h
> +++ b/include/hw/virtio/virtio-blk.h
> @@ -134,29 +134,32 @@ typedef struct VirtIOBlock {
>       struct VirtIOBlockDataPlane *dataplane;
>   } VirtIOBlock;
>   
> -#define VIRTIO_BLK_MAX_MERGE_REQS 32
> -
> -typedef struct MultiReqBuffer {
> -    BlockRequest        blkreq[VIRTIO_BLK_MAX_MERGE_REQS];
> -    unsigned int        num_writes;
> -} MultiReqBuffer;
> -
>   typedef struct VirtIOBlockReq {
> +    int64_t sector_num;
>       VirtIOBlock *dev;
>       VirtQueueElement elem;
>       struct virtio_blk_inhdr *in;
>       struct virtio_blk_outhdr out;
>       QEMUIOVector qiov;
>       struct VirtIOBlockReq *next;
> +    struct VirtIOBlockReq *mr_next;
>       BlockAcctCookie acct;
>   } VirtIOBlockReq;
>   
> +#define VIRTIO_BLK_MAX_MERGE_REQS 32
> +
> +typedef struct MultiReqBuffer {
> +    VirtIOBlockReq *reqs[VIRTIO_BLK_MAX_MERGE_REQS];
> +    unsigned int num_reqs;
> +    bool is_write;
> +} MultiReqBuffer;
> +
>   VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s);
>   
>   void virtio_blk_free_request(VirtIOBlockReq *req);
>   
>   void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb);
>   
> -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb);
> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb);
>   
>   #endif
> diff --git a/trace-events b/trace-events
> index 4ec81eb..9ef3b68 100644
> --- a/trace-events
> +++ b/trace-events
> @@ -116,6 +116,7 @@ virtio_blk_req_complete(void *req, int status) "req %p status %d"
>   virtio_blk_rw_complete(void *req, int ret) "req %p ret %d"
>   virtio_blk_handle_write(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
>   virtio_blk_handle_read(void *req, uint64_t sector, size_t nsectors) "req %p sector %"PRIu64" nsectors %zu"
> +virtio_blk_submit_multireq(void *mrb, int start, int num_reqs, uint64_t sector, size_t nsectors, bool is_write) "mrb %p start %d num_reqs %d sector %"PRIu64" nsectors %zu is_write %d"
>   
>   # hw/block/dataplane/virtio-blk.c
>   virtio_blk_data_plane_start(void *s) "dataplane %p"

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 17:16   ` Max Reitz
@ 2015-01-30 21:05     ` Peter Lieven
  2015-01-30 21:15       ` Kevin Wolf
  2015-01-30 21:30       ` Max Reitz
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Lieven @ 2015-01-30 21:05 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

Am 30.01.2015 um 18:16 schrieb Max Reitz:
> On 2015-01-30 at 09:33, Peter Lieven wrote:
>> this patch finally introduces multiread support to virtio-blk. While
>> multiwrite support was there for a long time, read support was missing.
>>
>> The complete merge logic is moved into virtio-blk.c which has
>> been the only user of request merging ever since. This is required
>> to be able to merge chunks of requests and immediately invoke callbacks
>> for those requests. Secondly, this is required to switch to
>> direct invocation of coroutines which is planned at a later stage.
>>
>> The following benchmarks show the performance of running fio with
>> 4 worker threads on a local ram disk. The numbers show the average
>> of 10 test runs after 1 run as warmup phase.
>>
>>                |        4k        |       64k        |        4k
>> MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
>> --------------+--------+---------+--------+---------+--------+--------
>> master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
>> multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   hw/block/dataplane/virtio-blk.c |   8 +-
>>   hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
>>   include/hw/virtio/virtio-blk.h  |  19 +--
>>   trace-events                    |   1 +
>>   4 files changed, 210 insertions(+), 106 deletions(-)
>>
>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>> index 39c5d71..93472e9 100644
>> --- a/hw/block/dataplane/virtio-blk.c
>> +++ b/hw/block/dataplane/virtio-blk.c
>> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
>>       event_notifier_test_and_clear(&s->host_notifier);
>>       blk_io_plug(s->conf->conf.blk);
>>       for (;;) {
>> -        MultiReqBuffer mrb = {
>> -            .num_writes = 0,
>> -        };
>> +        MultiReqBuffer mrb = {};
>>           int ret;
>>             /* Disable guest->host notifies to avoid unnecessary vmexits */
>> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
>>               virtio_blk_handle_request(req, &mrb);
>>           }
>>   -        virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
>> +        if (mrb.num_reqs) {
>> +            virtio_submit_multireq(s->conf->conf.blk, &mrb);
>> +        }
>>             if (likely(ret == -EAGAIN)) { /* vring emptied */
>>               /* Re-enable guest->host notifies and stop processing the vring.
>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>> index e04adb8..77890a0 100644
>> --- a/hw/block/virtio-blk.c
>> +++ b/hw/block/virtio-blk.c
>> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>       req->dev = s;
>>       req->qiov.size = 0;
>>       req->next = NULL;
>> +    req->mr_next = NULL;
>>       return req;
>>   }
>>   @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
>>     static void virtio_blk_rw_complete(void *opaque, int ret)
>>   {
>> -    VirtIOBlockReq *req = opaque;
>> +    VirtIOBlockReq *next = opaque;
>>   -    trace_virtio_blk_rw_complete(req, ret);
>> +    while (next) {
>> +        VirtIOBlockReq *req = next;
>> +        next = req->mr_next;
>> +        trace_virtio_blk_rw_complete(req, ret);
>>   -    if (ret) {
>> -        int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> -        bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> -        if (virtio_blk_handle_rw_error(req, -ret, is_read))
>> -            return;
>> -    }
>> +        if (req->qiov.nalloc != -1) {
>> +            qemu_iovec_destroy(&req->qiov);
>> +        }
>>   -    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>> -    block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>> -    virtio_blk_free_request(req);
>> +        if (ret) {
>> +            int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>> +            bool is_read = !(p & VIRTIO_BLK_T_OUT);
>> +            if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
>> +                continue;
>> +            }
>> +        }
>> +
>> +        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>> +        block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>> +        virtio_blk_free_request(req);
>> +    }
>>   }
>>     static void virtio_blk_flush_complete(void *opaque, int ret)
>> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
>>       }
>>   }
>>   -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
>> +static inline void
>
> Is the "inline" really worth it? This function is rather long...

I tried really hard to not add a regression for random reads and the difference
between good and bad here is sometimes just a matter of a few small changes.
If gcc inlines on its own anyway would you mind to keep the inline? The function
is actually only called from 2 places.

>
> (To my surprise, gcc actually does inline the function)
>
>> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,
>
> Hm, that's not a very descriptive name... Maybe "submit_merged_requests" or something like that (it's static, so you can drop the virtio_ prefix if you want to) would be better?

Sure, you are right.

>
>> +                        int start, int num_reqs, int niov)
>>   {
>> -    int i, ret;
>> +    QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
>> +    int64_t sector_num = mrb->reqs[start]->sector_num;
>> +    int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
>> +    bool is_write = mrb->is_write;
>> +
>> +    if (num_reqs > 1) {
>> +        int i;
>> +        struct iovec *_iov = qiov->iov;
>> +        int _niov = qiov->niov;
>
> Identifiers with leading underscores are considered reserved, I'd rather avoid using them.

Okay. I will use tmp_iov and tmp_niov instead.

>
>> +
>> +        qemu_iovec_init(qiov, niov);
>> +
>> +        for (i = 0; i < _niov; i++) {
>> +            qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
>> +        }
>>   -    if (!mrb->num_writes) {
>> +        for (i = start + 1; i < start + num_reqs; i++) {
>> +            qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
>> +                              mrb->reqs[i]->qiov.size);
>> +            mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
>> +            nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
>> +        }
>> +        assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
>> +
>> +        trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
>> +                                         nb_sectors, is_write);
>> +        block_acct_merge_done(blk_get_stats(blk),
>> +                              is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
>> +                              num_reqs - 1);
>> +    }
>> +
>> +    if (is_write) {
>> +        blk_aio_writev(blk, sector_num, qiov, nb_sectors,
>> +                       virtio_blk_rw_complete, mrb->reqs[start]);
>> +    } else {
>> +        blk_aio_readv(blk, sector_num, qiov, nb_sectors,
>> +                      virtio_blk_rw_complete, mrb->reqs[start]);
>> +    }
>> +}
>> +
>> +static int virtio_multireq_compare(const void *a, const void *b)
>> +{
>> +    const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
>> +                         *req2 = *(VirtIOBlockReq **)b;
>> +
>> +    /*
>> +     * Note that we can't simply subtract sector_num1 from sector_num2
>> +     * here as that could overflow the return value.
>> +     */
>> +    if (req1->sector_num > req2->sector_num) {
>> +        return 1;
>> +    } else if (req1->sector_num < req2->sector_num) {
>> +        return -1;
>> +    } else {
>> +        return 0;
>> +    }
>> +}
>> +
>> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
>> +{
>> +    int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
>> +        max_xfer_len = 0;
>
> Albeit not necessary, starting a new "int" line won't take any more space (and would look nicer to me, probably).

okay, I don't mind.

>
>> +    int64_t sector_num = 0;
>> +
>> +    if (mrb->num_reqs == 1) {
>> +        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
>> +        mrb->num_reqs = 0;
>>           return;
>>       }
>>   -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>> -    if (ret != 0) {
>> -        for (i = 0; i < mrb->num_writes; i++) {
>> -            if (mrb->blkreq[i].error) {
>> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>> +    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
>> +    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
>> +
>> +    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>> +          &virtio_multireq_compare);
>> +
>> +    for (i = 0; i < mrb->num_reqs; i++) {
>> +        VirtIOBlockReq *req = mrb->reqs[i];
>> +        if (num_reqs > 0) {
>> +            bool merge = true;
>> +
>> +            /* merge would exceed maximum number of IOVs */
>> +            if (niov + req->qiov.niov + 1 > IOV_MAX) {
>
> Hm, why the +1?

A really good question. I copied this piece from the old merge routine. It seems
definetely wrong.

>
>> +                merge = false;
>> +            }
>> +
>> +            /* merge would exceed maximum transfer length of backend device */
>> +            if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > max_xfer_len) {
>> +                merge = false;
>> +            }
>> +
>> +            /* requests are not sequential */
>> +            if (sector_num + nb_sectors != req->sector_num) {
>> +                merge = false;
>> +            }
>> +
>> +            if (!merge) {
>> +                virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>> +                num_reqs = 0;
>>               }
>>           }
>> +
>> +        if (num_reqs == 0) {
>> +            sector_num = req->sector_num;
>> +            nb_sectors = niov = 0;
>> +            start = i;
>> +        }
>> +
>> +        nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
>> +        niov += req->qiov.niov;
>> +        num_reqs++;
>>       }
>>   -    mrb->num_writes = 0;
>> +    virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>> +    mrb->num_reqs = 0;
>>   }
>>     static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>       /*
>>        * Make sure all outstanding writes are posted to the backing device.
>>        */
>> -    virtio_submit_multiwrite(req->dev->blk, mrb);
>> +    if (mrb->is_write && mrb->num_reqs > 0) {
>> +        virtio_submit_multireq(req->dev->blk, mrb);
>> +    }
>>       blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>>   }
>>   @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>       uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
>>       uint64_t total_sectors;
>>   +    if (nb_sectors > INT_MAX) {
>> +        return false;
>> +    }
>>       if (sector & dev->sector_mask) {
>>           return false;
>>       }
>> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>       return true;
>>   }
>>   -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>> -{
>> -    BlockRequest *blkreq;
>> -    uint64_t sector;
>> -
>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> -    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
>> -
>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> -        virtio_blk_free_request(req);
>> -        return;
>> -    }
>> -
>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> -                     BLOCK_ACCT_WRITE);
>> -
>> -    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
>> -        virtio_submit_multiwrite(req->dev->blk, mrb);
>> -    }
>> -
>> -    blkreq = &mrb->blkreq[mrb->num_writes];
>> -    blkreq->sector = sector;
>> -    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>> -    blkreq->qiov = &req->qiov;
>> -    blkreq->cb = virtio_blk_rw_complete;
>> -    blkreq->opaque = req;
>> -    blkreq->error = 0;
>> -
>> -    mrb->num_writes++;
>> -}
>> -
>> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
>> -{
>> -    uint64_t sector;
>> -
>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>> -
>> -    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
>> -
>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> -        virtio_blk_free_request(req);
>> -        return;
>> -    }
>> -
>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>> -                     BLOCK_ACCT_READ);
>> -    blk_aio_readv(req->dev->blk, sector, &req->qiov,
>> -                  req->qiov.size / BDRV_SECTOR_SIZE,
>> -                  virtio_blk_rw_complete, req);
>> -}
>> -
>>   void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>   {
>>       uint32_t type;
>> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>         type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>   -    if (type & VIRTIO_BLK_T_FLUSH) {
>> +    switch (type & 0xff) {
>
> Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too.

The problem is that type encodes operations and flags and there is no official mask to distinguish them in
the specs defined. If anytime a new flag is defined it will break the switch statement. I could add a macro
for the 0xff in the headers if you like?

>
>> +    case VIRTIO_BLK_T_IN:
>> +    case VIRTIO_BLK_T_OUT:
>> +    {
>> +        bool is_write = type & VIRTIO_BLK_T_OUT;
>> +        req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
>> +                                       &req->out.sector);
>> +
>> +        if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
>> +                                      req->qiov.size)) {
>> +            virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>> +            virtio_blk_free_request(req);
>> +            return;
>> +        }
>> +
>> +        if (is_write) {
>> +            qemu_iovec_init_external(&req->qiov, iov, out_num);
>> +            trace_virtio_blk_handle_write(req, req->sector_num,
>> +                                          req->qiov.size / BDRV_SECTOR_SIZE);
>> +        } else {
>> +            qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>> +            trace_virtio_blk_handle_read(req, req->sector_num,
>> +                                         req->qiov.size / BDRV_SECTOR_SIZE);
>> +        }
>
> Before this patch, req->qiov (and subsequently req->qiov.size) was initialized before virtio_blk_sect_range_ok() was called (with req->qiov.size as an argument). Is it fine to change that?

Oops, thats definetly a mistake. Half of the checks in virtio_blk_sect_range_ok won't work with the size set.

>
>> +
>> +        block_acct_start(blk_get_stats(req->dev->blk),
>> +                         &req->acct, req->qiov.size,
>> +                         is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>> +
>> +        /* merge would exceed maximum number of requests or IO direction
>> +         * changes */
>> +        if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
>> +                                  is_write != mrb->is_write)) {
>> +            virtio_submit_multireq(req->dev->blk, mrb);
>> +        }
>> +
>> +        mrb->reqs[mrb->num_reqs++] = req;
>
> Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) before this. Feel free to ignore my request.

This can't happen and the check for this is right above. Of what scenario do you think of that could overflow num_reqs?

>
>> +        mrb->is_write = is_write;
>> +        break;
>> +    }
>> +    case VIRTIO_BLK_T_FLUSH:
>
> I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was "type & VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_...").

VIRTIO_BLK_T_FLUSH_OUT is deprecated and not defined in the headers of qemu anymore. I can add it, but then I would also add VIRTIO_BLK_T_SCSI_CMD_OUT.

>
>>           virtio_blk_handle_flush(req, mrb);
>> -    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
>> +        break;
>> +    case VIRTIO_BLK_T_SCSI_CMD:
>
> And VIRTIO_BLK_T_SCSI_CMD_OUT here.

See above. I will define them in the headers.

>
>
> The overall logic of this patch seems good to me (although I had some trouble following which QIOVs had which purpose, which were allocated and which were not, and so on...).

If you tell me at which point it was most difficult I can try to add comments.

Thanks for reviewing,
Peter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 21:05     ` Peter Lieven
@ 2015-01-30 21:15       ` Kevin Wolf
  2015-01-31 19:46         ` Peter Lieven
  2015-01-30 21:30       ` Max Reitz
  1 sibling, 1 reply; 14+ messages in thread
From: Kevin Wolf @ 2015-01-30 21:15 UTC (permalink / raw)
  To: Peter Lieven
  Cc: famz, benoit, ming.lei, armbru, qemu-devel, stefanha, pbonzini,
	Max Reitz

Am 30.01.2015 um 22:05 hat Peter Lieven geschrieben:
> Am 30.01.2015 um 18:16 schrieb Max Reitz:
> > On 2015-01-30 at 09:33, Peter Lieven wrote:
> >> this patch finally introduces multiread support to virtio-blk. While
> >> multiwrite support was there for a long time, read support was missing.
> >>
> >> The complete merge logic is moved into virtio-blk.c which has
> >> been the only user of request merging ever since. This is required
> >> to be able to merge chunks of requests and immediately invoke callbacks
> >> for those requests. Secondly, this is required to switch to
> >> direct invocation of coroutines which is planned at a later stage.
> >>
> >> The following benchmarks show the performance of running fio with
> >> 4 worker threads on a local ram disk. The numbers show the average
> >> of 10 test runs after 1 run as warmup phase.
> >>
> >>                |        4k        |       64k        |        4k
> >> MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
> >> --------------+--------+---------+--------+---------+--------+--------
> >> master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
> >> multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
> >>
> >> Signed-off-by: Peter Lieven <pl@kamp.de>
> >> ---
> >>   hw/block/dataplane/virtio-blk.c |   8 +-
> >>   hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
> >>   include/hw/virtio/virtio-blk.h  |  19 +--
> >>   trace-events                    |   1 +
> >>   4 files changed, 210 insertions(+), 106 deletions(-)

> >> +    int64_t sector_num = 0;
> >> +
> >> +    if (mrb->num_reqs == 1) {
> >> +        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
> >> +        mrb->num_reqs = 0;
> >>           return;
> >>       }
> >>   -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
> >> -    if (ret != 0) {
> >> -        for (i = 0; i < mrb->num_writes; i++) {
> >> -            if (mrb->blkreq[i].error) {
> >> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
> >> +    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
> >> +    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
> >> +
> >> +    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
> >> +          &virtio_multireq_compare);
> >> +
> >> +    for (i = 0; i < mrb->num_reqs; i++) {
> >> +        VirtIOBlockReq *req = mrb->reqs[i];
> >> +        if (num_reqs > 0) {
> >> +            bool merge = true;
> >> +
> >> +            /* merge would exceed maximum number of IOVs */
> >> +            if (niov + req->qiov.niov + 1 > IOV_MAX) {
> >
> > Hm, why the +1?
> 
> A really good question. I copied this piece from the old merge routine. It seems
> definetely wrong.

The old code merged requests even if they were overlapping. This could
result in one area being split in two.

I think you don't support this here, so removing the + 1 is probably
okay.

Kevin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 21:05     ` Peter Lieven
  2015-01-30 21:15       ` Kevin Wolf
@ 2015-01-30 21:30       ` Max Reitz
  2015-01-31 19:45         ` Peter Lieven
  1 sibling, 1 reply; 14+ messages in thread
From: Max Reitz @ 2015-01-30 21:30 UTC (permalink / raw)
  To: Peter Lieven, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

On 2015-01-30 at 16:05, Peter Lieven wrote:
> Am 30.01.2015 um 18:16 schrieb Max Reitz:
>> On 2015-01-30 at 09:33, Peter Lieven wrote:
>>> this patch finally introduces multiread support to virtio-blk. While
>>> multiwrite support was there for a long time, read support was missing.
>>>
>>> The complete merge logic is moved into virtio-blk.c which has
>>> been the only user of request merging ever since. This is required
>>> to be able to merge chunks of requests and immediately invoke callbacks
>>> for those requests. Secondly, this is required to switch to
>>> direct invocation of coroutines which is planned at a later stage.
>>>
>>> The following benchmarks show the performance of running fio with
>>> 4 worker threads on a local ram disk. The numbers show the average
>>> of 10 test runs after 1 run as warmup phase.
>>>
>>>                 |        4k        |       64k        |        4k
>>> MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
>>> --------------+--------+---------+--------+---------+--------+--------
>>> master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
>>> multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>>    hw/block/dataplane/virtio-blk.c |   8 +-
>>>    hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
>>>    include/hw/virtio/virtio-blk.h  |  19 +--
>>>    trace-events                    |   1 +
>>>    4 files changed, 210 insertions(+), 106 deletions(-)
>>>
>>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>>> index 39c5d71..93472e9 100644
>>> --- a/hw/block/dataplane/virtio-blk.c
>>> +++ b/hw/block/dataplane/virtio-blk.c
>>> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
>>>        event_notifier_test_and_clear(&s->host_notifier);
>>>        blk_io_plug(s->conf->conf.blk);
>>>        for (;;) {
>>> -        MultiReqBuffer mrb = {
>>> -            .num_writes = 0,
>>> -        };
>>> +        MultiReqBuffer mrb = {};
>>>            int ret;
>>>              /* Disable guest->host notifies to avoid unnecessary vmexits */
>>> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
>>>                virtio_blk_handle_request(req, &mrb);
>>>            }
>>>    -        virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
>>> +        if (mrb.num_reqs) {
>>> +            virtio_submit_multireq(s->conf->conf.blk, &mrb);
>>> +        }
>>>              if (likely(ret == -EAGAIN)) { /* vring emptied */
>>>                /* Re-enable guest->host notifies and stop processing the vring.
>>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>>> index e04adb8..77890a0 100644
>>> --- a/hw/block/virtio-blk.c
>>> +++ b/hw/block/virtio-blk.c
>>> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>>        req->dev = s;
>>>        req->qiov.size = 0;
>>>        req->next = NULL;
>>> +    req->mr_next = NULL;
>>>        return req;
>>>    }
>>>    @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
>>>      static void virtio_blk_rw_complete(void *opaque, int ret)
>>>    {
>>> -    VirtIOBlockReq *req = opaque;
>>> +    VirtIOBlockReq *next = opaque;
>>>    -    trace_virtio_blk_rw_complete(req, ret);
>>> +    while (next) {
>>> +        VirtIOBlockReq *req = next;
>>> +        next = req->mr_next;
>>> +        trace_virtio_blk_rw_complete(req, ret);
>>>    -    if (ret) {
>>> -        int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>> -        bool is_read = !(p & VIRTIO_BLK_T_OUT);
>>> -        if (virtio_blk_handle_rw_error(req, -ret, is_read))
>>> -            return;
>>> -    }
>>> +        if (req->qiov.nalloc != -1) {
>>> +            qemu_iovec_destroy(&req->qiov);
>>> +        }
>>>    -    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>>> -    block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>>> -    virtio_blk_free_request(req);
>>> +        if (ret) {
>>> +            int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>> +            bool is_read = !(p & VIRTIO_BLK_T_OUT);
>>> +            if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
>>> +                continue;
>>> +            }
>>> +        }
>>> +
>>> +        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>>> +        block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>>> +        virtio_blk_free_request(req);
>>> +    }
>>>    }
>>>      static void virtio_blk_flush_complete(void *opaque, int ret)
>>> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
>>>        }
>>>    }
>>>    -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
>>> +static inline void
>> Is the "inline" really worth it? This function is rather long...
> I tried really hard to not add a regression for random reads and the difference
> between good and bad here is sometimes just a matter of a few small changes.
> If gcc inlines on its own anyway would you mind to keep the inline? The function
> is actually only called from 2 places.

Okay, feel free to keep it.

>> (To my surprise, gcc actually does inline the function)
>>
>>> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,
>> Hm, that's not a very descriptive name... Maybe "submit_merged_requests" or something like that (it's static, so you can drop the virtio_ prefix if you want to) would be better?
> Sure, you are right.
>
>>> +                        int start, int num_reqs, int niov)
>>>    {
>>> -    int i, ret;
>>> +    QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
>>> +    int64_t sector_num = mrb->reqs[start]->sector_num;
>>> +    int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
>>> +    bool is_write = mrb->is_write;
>>> +
>>> +    if (num_reqs > 1) {
>>> +        int i;
>>> +        struct iovec *_iov = qiov->iov;
>>> +        int _niov = qiov->niov;
>> Identifiers with leading underscores are considered reserved, I'd rather avoid using them.
> Okay. I will use tmp_iov and tmp_niov instead.
>
>>> +
>>> +        qemu_iovec_init(qiov, niov);
>>> +
>>> +        for (i = 0; i < _niov; i++) {
>>> +            qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
>>> +        }
>>>    -    if (!mrb->num_writes) {
>>> +        for (i = start + 1; i < start + num_reqs; i++) {
>>> +            qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
>>> +                              mrb->reqs[i]->qiov.size);
>>> +            mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
>>> +            nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
>>> +        }
>>> +        assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
>>> +
>>> +        trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
>>> +                                         nb_sectors, is_write);
>>> +        block_acct_merge_done(blk_get_stats(blk),
>>> +                              is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
>>> +                              num_reqs - 1);
>>> +    }
>>> +
>>> +    if (is_write) {
>>> +        blk_aio_writev(blk, sector_num, qiov, nb_sectors,
>>> +                       virtio_blk_rw_complete, mrb->reqs[start]);
>>> +    } else {
>>> +        blk_aio_readv(blk, sector_num, qiov, nb_sectors,
>>> +                      virtio_blk_rw_complete, mrb->reqs[start]);
>>> +    }
>>> +}
>>> +
>>> +static int virtio_multireq_compare(const void *a, const void *b)
>>> +{
>>> +    const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
>>> +                         *req2 = *(VirtIOBlockReq **)b;
>>> +
>>> +    /*
>>> +     * Note that we can't simply subtract sector_num1 from sector_num2
>>> +     * here as that could overflow the return value.
>>> +     */
>>> +    if (req1->sector_num > req2->sector_num) {
>>> +        return 1;
>>> +    } else if (req1->sector_num < req2->sector_num) {
>>> +        return -1;
>>> +    } else {
>>> +        return 0;
>>> +    }
>>> +}
>>> +
>>> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
>>> +{
>>> +    int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
>>> +        max_xfer_len = 0;
>> Albeit not necessary, starting a new "int" line won't take any more space (and would look nicer to me, probably).
> okay, I don't mind.
>
>>> +    int64_t sector_num = 0;
>>> +
>>> +    if (mrb->num_reqs == 1) {
>>> +        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
>>> +        mrb->num_reqs = 0;
>>>            return;
>>>        }
>>>    -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>>> -    if (ret != 0) {
>>> -        for (i = 0; i < mrb->num_writes; i++) {
>>> -            if (mrb->blkreq[i].error) {
>>> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>>> +    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
>>> +    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
>>> +
>>> +    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>>> +          &virtio_multireq_compare);
>>> +
>>> +    for (i = 0; i < mrb->num_reqs; i++) {
>>> +        VirtIOBlockReq *req = mrb->reqs[i];
>>> +        if (num_reqs > 0) {
>>> +            bool merge = true;
>>> +
>>> +            /* merge would exceed maximum number of IOVs */
>>> +            if (niov + req->qiov.niov + 1 > IOV_MAX) {
>> Hm, why the +1?
> A really good question. I copied this piece from the old merge routine. It seems
> definetely wrong.
>
>>> +                merge = false;
>>> +            }
>>> +
>>> +            /* merge would exceed maximum transfer length of backend device */
>>> +            if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > max_xfer_len) {
>>> +                merge = false;
>>> +            }
>>> +
>>> +            /* requests are not sequential */
>>> +            if (sector_num + nb_sectors != req->sector_num) {
>>> +                merge = false;
>>> +            }
>>> +
>>> +            if (!merge) {
>>> +                virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>>> +                num_reqs = 0;
>>>                }
>>>            }
>>> +
>>> +        if (num_reqs == 0) {
>>> +            sector_num = req->sector_num;
>>> +            nb_sectors = niov = 0;
>>> +            start = i;
>>> +        }
>>> +
>>> +        nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
>>> +        niov += req->qiov.niov;
>>> +        num_reqs++;
>>>        }
>>>    -    mrb->num_writes = 0;
>>> +    virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>>> +    mrb->num_reqs = 0;
>>>    }
>>>      static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>        /*
>>>         * Make sure all outstanding writes are posted to the backing device.
>>>         */
>>> -    virtio_submit_multiwrite(req->dev->blk, mrb);
>>> +    if (mrb->is_write && mrb->num_reqs > 0) {
>>> +        virtio_submit_multireq(req->dev->blk, mrb);
>>> +    }
>>>        blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>>>    }
>>>    @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>>        uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
>>>        uint64_t total_sectors;
>>>    +    if (nb_sectors > INT_MAX) {
>>> +        return false;
>>> +    }
>>>        if (sector & dev->sector_mask) {
>>>            return false;
>>>        }
>>> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>>        return true;
>>>    }
>>>    -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>> -{
>>> -    BlockRequest *blkreq;
>>> -    uint64_t sector;
>>> -
>>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>>> -
>>> -    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
>>> -
>>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>>> -        virtio_blk_free_request(req);
>>> -        return;
>>> -    }
>>> -
>>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>>> -                     BLOCK_ACCT_WRITE);
>>> -
>>> -    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
>>> -        virtio_submit_multiwrite(req->dev->blk, mrb);
>>> -    }
>>> -
>>> -    blkreq = &mrb->blkreq[mrb->num_writes];
>>> -    blkreq->sector = sector;
>>> -    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>>> -    blkreq->qiov = &req->qiov;
>>> -    blkreq->cb = virtio_blk_rw_complete;
>>> -    blkreq->opaque = req;
>>> -    blkreq->error = 0;
>>> -
>>> -    mrb->num_writes++;
>>> -}
>>> -
>>> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
>>> -{
>>> -    uint64_t sector;
>>> -
>>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>>> -
>>> -    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
>>> -
>>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>>> -        virtio_blk_free_request(req);
>>> -        return;
>>> -    }
>>> -
>>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>>> -                     BLOCK_ACCT_READ);
>>> -    blk_aio_readv(req->dev->blk, sector, &req->qiov,
>>> -                  req->qiov.size / BDRV_SECTOR_SIZE,
>>> -                  virtio_blk_rw_complete, req);
>>> -}
>>> -
>>>    void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>    {
>>>        uint32_t type;
>>> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>          type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>>    -    if (type & VIRTIO_BLK_T_FLUSH) {
>>> +    switch (type & 0xff) {
>> Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too.
> The problem is that type encodes operations and flags and there is no official mask to distinguish them in
> the specs defined.

Well, VIRTIO_BLK_T_BARRIER is part of the legacy interface, and the 
non-legacy interface does not have any flags whatsoever (well, one could 
see VIRTIO_BLK_T_OUT as a flag, but it's not defined as such).

> If anytime a new flag is defined it will break the switch statement.

No, it won't. If a new flag is added, support for it will be negotiated 
through the feature bits; unless qemu supports that flag, the driver 
won't set it.

(And if someone wants to implement that new flag, he/she will definitely 
have to work on the switch statement anyway)

> I could add a macro
> for the 0xff in the headers if you like?

If you want to keep 0xff as the bit mask, I'd rather not use a macro; 
using a macro may give readers of the code the impression that 0xff is 
*not* arbitrary, whereas in reality it is.

On the other hand, I would not oppose using a macro if you were to use 
~VIRTIO_BLK_T_BARRIER (or maybe even ~(VIRTIO_BLK_T_BARRIER | 
VIRTIO_BLK_T_OUT), with VIRTIO_BLK_T_OUT being some kind of a flag, too) 
because that value is not arbitrary.

>
>>> +    case VIRTIO_BLK_T_IN:
>>> +    case VIRTIO_BLK_T_OUT:
>>> +    {
>>> +        bool is_write = type & VIRTIO_BLK_T_OUT;
>>> +        req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
>>> +                                       &req->out.sector);
>>> +
>>> +        if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
>>> +                                      req->qiov.size)) {
>>> +            virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>>> +            virtio_blk_free_request(req);
>>> +            return;
>>> +        }
>>> +
>>> +        if (is_write) {
>>> +            qemu_iovec_init_external(&req->qiov, iov, out_num);
>>> +            trace_virtio_blk_handle_write(req, req->sector_num,
>>> +                                          req->qiov.size / BDRV_SECTOR_SIZE);
>>> +        } else {
>>> +            qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>>> +            trace_virtio_blk_handle_read(req, req->sector_num,
>>> +                                         req->qiov.size / BDRV_SECTOR_SIZE);
>>> +        }
>> Before this patch, req->qiov (and subsequently req->qiov.size) was initialized before virtio_blk_sect_range_ok() was called (with req->qiov.size as an argument). Is it fine to change that?
> Oops, thats definetly a mistake. Half of the checks in virtio_blk_sect_range_ok won't work with the size set.
>
>>> +
>>> +        block_acct_start(blk_get_stats(req->dev->blk),
>>> +                         &req->acct, req->qiov.size,
>>> +                         is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>>> +
>>> +        /* merge would exceed maximum number of requests or IO direction
>>> +         * changes */
>>> +        if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
>>> +                                  is_write != mrb->is_write)) {
>>> +            virtio_submit_multireq(req->dev->blk, mrb);
>>> +        }
>>> +
>>> +        mrb->reqs[mrb->num_reqs++] = req;
>> Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) before this. Feel free to ignore my request.
> This can't happen

Right. If it could, I would not propose an assert() but a real error 
message. assert()s are for cases that "will never happen".

> and the check for this is right above.

Assuming virtio_submit_multireq() actually reduces mrb->num_reqs. Of 
course it will now, but if someone accidentally breaks it for some case...

> Of what scenario do you think of that could overflow num_reqs?

With this patch alone: Nothing. But I can imagine someone breaking 
virtio_submit_multireq()'s contract ("reduce mrb->num_reqs to 0").

My point is that it'll be really bad if this array access overflows for 
some really stupid reason. It's up to you whether you want to make the 
additional check (which I don't think will cost any measurable 
performance) or not.

>>> +        mrb->is_write = is_write;
>>> +        break;
>>> +    }
>>> +    case VIRTIO_BLK_T_FLUSH:
>> I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was "type & VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_...").
> VIRTIO_BLK_T_FLUSH_OUT is deprecated and not defined in the headers of qemu anymore. I can add it, but then I would also add VIRTIO_BLK_T_SCSI_CMD_OUT.

Hm, okay. In that case, I might once again propose using type & 
~(VIRTIO_BLK_T_BARRIER | VIRTIO_BLK_T_OUT) or type & 0xfe (or type & 
0xff & ~VIRTIO_BLK_T_OUT).

>>>            virtio_blk_handle_flush(req, mrb);
>>> -    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
>>> +        break;
>>> +    case VIRTIO_BLK_T_SCSI_CMD:
>> And VIRTIO_BLK_T_SCSI_CMD_OUT here.
> See above. I will define them in the headers.
>
>>
>> The overall logic of this patch seems good to me (although I had some trouble following which QIOVs had which purpose, which were allocated and which were not, and so on...).
> If you tell me at which point it was most difficult I can try to add comments.

Well, virtio_submit_multireq2() first stored the qiov->iov pointer and 
than overwrote that (qiov->iov) through qemu_iovec_init(). I think a 
comment about this being because the I/O vectors are external (and thus 
cannot be changed) might have helped.

Then I was wondering whether you'd leak _iov until it appeared to me 
that this wasn't the case because _iov comes from the guest or just 
generally from somewhere outside of this file.

And finally I was wondering about the if (req->qiov.nalloc != -1) { 
qemu_iovec_destroy(&req->qiov); } in virtio_blk_rw_complete(), until I 
figured out that most of the I/O vectors are external vectors (indicated 
by qiov.nalloc == -1) and thus are not to be freed here; and there is 
only one case where qiov.nalloc != -1 and that is the QIOV which has 
been created to contain all merged requests; but this was mostly due to 
me not having had too much contact with the allocation and distribution 
of I/O vectors yet.

Max

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 21:30       ` Max Reitz
@ 2015-01-31 19:45         ` Peter Lieven
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Lieven @ 2015-01-31 19:45 UTC (permalink / raw)
  To: Max Reitz, qemu-devel
  Cc: kwolf, famz, benoit, ming.lei, armbru, stefanha, pbonzini

Am 30.01.2015 um 22:30 schrieb Max Reitz:
> On 2015-01-30 at 16:05, Peter Lieven wrote:
>> Am 30.01.2015 um 18:16 schrieb Max Reitz:
>>> On 2015-01-30 at 09:33, Peter Lieven wrote:
>>>> this patch finally introduces multiread support to virtio-blk. While
>>>> multiwrite support was there for a long time, read support was missing.
>>>>
>>>> The complete merge logic is moved into virtio-blk.c which has
>>>> been the only user of request merging ever since. This is required
>>>> to be able to merge chunks of requests and immediately invoke callbacks
>>>> for those requests. Secondly, this is required to switch to
>>>> direct invocation of coroutines which is planned at a later stage.
>>>>
>>>> The following benchmarks show the performance of running fio with
>>>> 4 worker threads on a local ram disk. The numbers show the average
>>>> of 10 test runs after 1 run as warmup phase.
>>>>
>>>>                 |        4k        |       64k        |        4k
>>>> MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
>>>> --------------+--------+---------+--------+---------+--------+--------
>>>> master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
>>>> multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>>    hw/block/dataplane/virtio-blk.c |   8 +-
>>>>    hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
>>>>    include/hw/virtio/virtio-blk.h  |  19 +--
>>>>    trace-events                    |   1 +
>>>>    4 files changed, 210 insertions(+), 106 deletions(-)
>>>>
>>>> diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
>>>> index 39c5d71..93472e9 100644
>>>> --- a/hw/block/dataplane/virtio-blk.c
>>>> +++ b/hw/block/dataplane/virtio-blk.c
>>>> @@ -96,9 +96,7 @@ static void handle_notify(EventNotifier *e)
>>>>        event_notifier_test_and_clear(&s->host_notifier);
>>>>        blk_io_plug(s->conf->conf.blk);
>>>>        for (;;) {
>>>> -        MultiReqBuffer mrb = {
>>>> -            .num_writes = 0,
>>>> -        };
>>>> +        MultiReqBuffer mrb = {};
>>>>            int ret;
>>>>              /* Disable guest->host notifies to avoid unnecessary vmexits */
>>>> @@ -120,7 +118,9 @@ static void handle_notify(EventNotifier *e)
>>>>                virtio_blk_handle_request(req, &mrb);
>>>>            }
>>>>    -        virtio_submit_multiwrite(s->conf->conf.blk, &mrb);
>>>> +        if (mrb.num_reqs) {
>>>> +            virtio_submit_multireq(s->conf->conf.blk, &mrb);
>>>> +        }
>>>>              if (likely(ret == -EAGAIN)) { /* vring emptied */
>>>>                /* Re-enable guest->host notifies and stop processing the vring.
>>>> diff --git a/hw/block/virtio-blk.c b/hw/block/virtio-blk.c
>>>> index e04adb8..77890a0 100644
>>>> --- a/hw/block/virtio-blk.c
>>>> +++ b/hw/block/virtio-blk.c
>>>> @@ -34,6 +34,7 @@ VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>>>        req->dev = s;
>>>>        req->qiov.size = 0;
>>>>        req->next = NULL;
>>>> +    req->mr_next = NULL;
>>>>        return req;
>>>>    }
>>>>    @@ -84,20 +85,29 @@ static int virtio_blk_handle_rw_error(VirtIOBlockReq *req, int error,
>>>>      static void virtio_blk_rw_complete(void *opaque, int ret)
>>>>    {
>>>> -    VirtIOBlockReq *req = opaque;
>>>> +    VirtIOBlockReq *next = opaque;
>>>>    -    trace_virtio_blk_rw_complete(req, ret);
>>>> +    while (next) {
>>>> +        VirtIOBlockReq *req = next;
>>>> +        next = req->mr_next;
>>>> +        trace_virtio_blk_rw_complete(req, ret);
>>>>    -    if (ret) {
>>>> -        int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>>> -        bool is_read = !(p & VIRTIO_BLK_T_OUT);
>>>> -        if (virtio_blk_handle_rw_error(req, -ret, is_read))
>>>> -            return;
>>>> -    }
>>>> +        if (req->qiov.nalloc != -1) {
>>>> +            qemu_iovec_destroy(&req->qiov);
>>>> +        }
>>>>    -    virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>>>> -    block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>>>> -    virtio_blk_free_request(req);
>>>> +        if (ret) {
>>>> +            int p = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>>> +            bool is_read = !(p & VIRTIO_BLK_T_OUT);
>>>> +            if (virtio_blk_handle_rw_error(req, -ret, is_read)) {
>>>> +                continue;
>>>> +            }
>>>> +        }
>>>> +
>>>> +        virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
>>>> +        block_acct_done(blk_get_stats(req->dev->blk), &req->acct);
>>>> +        virtio_blk_free_request(req);
>>>> +    }
>>>>    }
>>>>      static void virtio_blk_flush_complete(void *opaque, int ret)
>>>> @@ -291,24 +301,125 @@ static void virtio_blk_handle_scsi(VirtIOBlockReq *req)
>>>>        }
>>>>    }
>>>>    -void virtio_submit_multiwrite(BlockBackend *blk, MultiReqBuffer *mrb)
>>>> +static inline void
>>> Is the "inline" really worth it? This function is rather long...
>> I tried really hard to not add a regression for random reads and the difference
>> between good and bad here is sometimes just a matter of a few small changes.
>> If gcc inlines on its own anyway would you mind to keep the inline? The function
>> is actually only called from 2 places.
>
> Okay, feel free to keep it.
>
>>> (To my surprise, gcc actually does inline the function)
>>>
>>>> +virtio_submit_multireq2(BlockBackend *blk, MultiReqBuffer *mrb,
>>> Hm, that's not a very descriptive name... Maybe "submit_merged_requests" or something like that (it's static, so you can drop the virtio_ prefix if you want to) would be better?
>> Sure, you are right.
>>
>>>> +                        int start, int num_reqs, int niov)
>>>>    {
>>>> -    int i, ret;
>>>> +    QEMUIOVector *qiov = &mrb->reqs[start]->qiov;
>>>> +    int64_t sector_num = mrb->reqs[start]->sector_num;
>>>> +    int nb_sectors = mrb->reqs[start]->qiov.size / BDRV_SECTOR_SIZE;
>>>> +    bool is_write = mrb->is_write;
>>>> +
>>>> +    if (num_reqs > 1) {
>>>> +        int i;
>>>> +        struct iovec *_iov = qiov->iov;
>>>> +        int _niov = qiov->niov;
>>> Identifiers with leading underscores are considered reserved, I'd rather avoid using them.
>> Okay. I will use tmp_iov and tmp_niov instead.
>>
>>>> +
>>>> +        qemu_iovec_init(qiov, niov);
>>>> +
>>>> +        for (i = 0; i < _niov; i++) {
>>>> +            qemu_iovec_add(qiov, _iov[i].iov_base, _iov[i].iov_len);
>>>> +        }
>>>>    -    if (!mrb->num_writes) {
>>>> +        for (i = start + 1; i < start + num_reqs; i++) {
>>>> +            qemu_iovec_concat(qiov, &mrb->reqs[i]->qiov, 0,
>>>> +                              mrb->reqs[i]->qiov.size);
>>>> +            mrb->reqs[i - 1]->mr_next = mrb->reqs[i];
>>>> +            nb_sectors += mrb->reqs[i]->qiov.size / BDRV_SECTOR_SIZE;
>>>> +        }
>>>> +        assert(nb_sectors == qiov->size / BDRV_SECTOR_SIZE);
>>>> +
>>>> +        trace_virtio_blk_submit_multireq(mrb, start, num_reqs, sector_num,
>>>> +                                         nb_sectors, is_write);
>>>> +        block_acct_merge_done(blk_get_stats(blk),
>>>> +                              is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ,
>>>> +                              num_reqs - 1);
>>>> +    }
>>>> +
>>>> +    if (is_write) {
>>>> +        blk_aio_writev(blk, sector_num, qiov, nb_sectors,
>>>> +                       virtio_blk_rw_complete, mrb->reqs[start]);
>>>> +    } else {
>>>> +        blk_aio_readv(blk, sector_num, qiov, nb_sectors,
>>>> +                      virtio_blk_rw_complete, mrb->reqs[start]);
>>>> +    }
>>>> +}
>>>> +
>>>> +static int virtio_multireq_compare(const void *a, const void *b)
>>>> +{
>>>> +    const VirtIOBlockReq *req1 = *(VirtIOBlockReq **)a,
>>>> +                         *req2 = *(VirtIOBlockReq **)b;
>>>> +
>>>> +    /*
>>>> +     * Note that we can't simply subtract sector_num1 from sector_num2
>>>> +     * here as that could overflow the return value.
>>>> +     */
>>>> +    if (req1->sector_num > req2->sector_num) {
>>>> +        return 1;
>>>> +    } else if (req1->sector_num < req2->sector_num) {
>>>> +        return -1;
>>>> +    } else {
>>>> +        return 0;
>>>> +    }
>>>> +}
>>>> +
>>>> +void virtio_submit_multireq(BlockBackend *blk, MultiReqBuffer *mrb)
>>>> +{
>>>> +    int i = 0, start = 0, num_reqs = 0, niov = 0, nb_sectors = 0,
>>>> +        max_xfer_len = 0;
>>> Albeit not necessary, starting a new "int" line won't take any more space (and would look nicer to me, probably).
>> okay, I don't mind.
>>
>>>> +    int64_t sector_num = 0;
>>>> +
>>>> +    if (mrb->num_reqs == 1) {
>>>> +        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
>>>> +        mrb->num_reqs = 0;
>>>>            return;
>>>>        }
>>>>    -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>>>> -    if (ret != 0) {
>>>> -        for (i = 0; i < mrb->num_writes; i++) {
>>>> -            if (mrb->blkreq[i].error) {
>>>> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>>>> +    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
>>>> +    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
>>>> +
>>>> +    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>>>> +          &virtio_multireq_compare);
>>>> +
>>>> +    for (i = 0; i < mrb->num_reqs; i++) {
>>>> +        VirtIOBlockReq *req = mrb->reqs[i];
>>>> +        if (num_reqs > 0) {
>>>> +            bool merge = true;
>>>> +
>>>> +            /* merge would exceed maximum number of IOVs */
>>>> +            if (niov + req->qiov.niov + 1 > IOV_MAX) {
>>> Hm, why the +1?
>> A really good question. I copied this piece from the old merge routine. It seems
>> definetely wrong.
>>
>>>> +                merge = false;
>>>> +            }
>>>> +
>>>> +            /* merge would exceed maximum transfer length of backend device */
>>>> +            if (req->qiov.size / BDRV_SECTOR_SIZE + nb_sectors > max_xfer_len) {
>>>> +                merge = false;
>>>> +            }
>>>> +
>>>> +            /* requests are not sequential */
>>>> +            if (sector_num + nb_sectors != req->sector_num) {
>>>> +                merge = false;
>>>> +            }
>>>> +
>>>> +            if (!merge) {
>>>> +                virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>>>> +                num_reqs = 0;
>>>>                }
>>>>            }
>>>> +
>>>> +        if (num_reqs == 0) {
>>>> +            sector_num = req->sector_num;
>>>> +            nb_sectors = niov = 0;
>>>> +            start = i;
>>>> +        }
>>>> +
>>>> +        nb_sectors += req->qiov.size / BDRV_SECTOR_SIZE;
>>>> +        niov += req->qiov.niov;
>>>> +        num_reqs++;
>>>>        }
>>>>    -    mrb->num_writes = 0;
>>>> +    virtio_submit_multireq2(blk, mrb, start, num_reqs, niov);
>>>> +    mrb->num_reqs = 0;
>>>>    }
>>>>      static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>> @@ -319,7 +430,9 @@ static void virtio_blk_handle_flush(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>>        /*
>>>>         * Make sure all outstanding writes are posted to the backing device.
>>>>         */
>>>> -    virtio_submit_multiwrite(req->dev->blk, mrb);
>>>> +    if (mrb->is_write && mrb->num_reqs > 0) {
>>>> +        virtio_submit_multireq(req->dev->blk, mrb);
>>>> +    }
>>>>        blk_aio_flush(req->dev->blk, virtio_blk_flush_complete, req);
>>>>    }
>>>>    @@ -329,6 +442,9 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>>>        uint64_t nb_sectors = size >> BDRV_SECTOR_BITS;
>>>>        uint64_t total_sectors;
>>>>    +    if (nb_sectors > INT_MAX) {
>>>> +        return false;
>>>> +    }
>>>>        if (sector & dev->sector_mask) {
>>>>            return false;
>>>>        }
>>>> @@ -342,60 +458,6 @@ static bool virtio_blk_sect_range_ok(VirtIOBlock *dev,
>>>>        return true;
>>>>    }
>>>>    -static void virtio_blk_handle_write(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>> -{
>>>> -    BlockRequest *blkreq;
>>>> -    uint64_t sector;
>>>> -
>>>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>>>> -
>>>> -    trace_virtio_blk_handle_write(req, sector, req->qiov.size / 512);
>>>> -
>>>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>>>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>>>> -        virtio_blk_free_request(req);
>>>> -        return;
>>>> -    }
>>>> -
>>>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>>>> -                     BLOCK_ACCT_WRITE);
>>>> -
>>>> -    if (mrb->num_writes == VIRTIO_BLK_MAX_MERGE_REQS) {
>>>> -        virtio_submit_multiwrite(req->dev->blk, mrb);
>>>> -    }
>>>> -
>>>> -    blkreq = &mrb->blkreq[mrb->num_writes];
>>>> -    blkreq->sector = sector;
>>>> -    blkreq->nb_sectors = req->qiov.size / BDRV_SECTOR_SIZE;
>>>> -    blkreq->qiov = &req->qiov;
>>>> -    blkreq->cb = virtio_blk_rw_complete;
>>>> -    blkreq->opaque = req;
>>>> -    blkreq->error = 0;
>>>> -
>>>> -    mrb->num_writes++;
>>>> -}
>>>> -
>>>> -static void virtio_blk_handle_read(VirtIOBlockReq *req)
>>>> -{
>>>> -    uint64_t sector;
>>>> -
>>>> -    sector = virtio_ldq_p(VIRTIO_DEVICE(req->dev), &req->out.sector);
>>>> -
>>>> -    trace_virtio_blk_handle_read(req, sector, req->qiov.size / 512);
>>>> -
>>>> -    if (!virtio_blk_sect_range_ok(req->dev, sector, req->qiov.size)) {
>>>> -        virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>>>> -        virtio_blk_free_request(req);
>>>> -        return;
>>>> -    }
>>>> -
>>>> -    block_acct_start(blk_get_stats(req->dev->blk), &req->acct, req->qiov.size,
>>>> -                     BLOCK_ACCT_READ);
>>>> -    blk_aio_readv(req->dev->blk, sector, &req->qiov,
>>>> -                  req->qiov.size / BDRV_SECTOR_SIZE,
>>>> -                  virtio_blk_rw_complete, req);
>>>> -}
>>>> -
>>>>    void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>>    {
>>>>        uint32_t type;
>>>> @@ -430,11 +492,54 @@ void virtio_blk_handle_request(VirtIOBlockReq *req, MultiReqBuffer *mrb)
>>>>          type = virtio_ldl_p(VIRTIO_DEVICE(req->dev), &req->out.type);
>>>>    -    if (type & VIRTIO_BLK_T_FLUSH) {
>>>> +    switch (type & 0xff) {
>>> Somehow I'd prefer type & ~VIRTIO_BLK_T_BARRIER, but this is correct, too.
>> The problem is that type encodes operations and flags and there is no official mask to distinguish them in
>> the specs defined.
>
> Well, VIRTIO_BLK_T_BARRIER is part of the legacy interface, and the non-legacy interface does not have any flags whatsoever (well, one could see VIRTIO_BLK_T_OUT as a flag, but it's not defined as such).
>
>> If anytime a new flag is defined it will break the switch statement.
>
> No, it won't. If a new flag is added, support for it will be negotiated through the feature bits; unless qemu supports that flag, the driver won't set it.
>
> (And if someone wants to implement that new flag, he/she will definitely have to work on the switch statement anyway)
>
>> I could add a macro
>> for the 0xff in the headers if you like?
>
> If you want to keep 0xff as the bit mask, I'd rather not use a macro; using a macro may give readers of the code the impression that 0xff is *not* arbitrary, whereas in reality it is.
>
> On the other hand, I would not oppose using a macro if you were to use ~VIRTIO_BLK_T_BARRIER (or maybe even ~(VIRTIO_BLK_T_BARRIER | VIRTIO_BLK_T_OUT), with VIRTIO_BLK_T_OUT being some kind of a flag, too) because that value is not arbitrary.
>
>>
>>>> +    case VIRTIO_BLK_T_IN:
>>>> +    case VIRTIO_BLK_T_OUT:
>>>> +    {
>>>> +        bool is_write = type & VIRTIO_BLK_T_OUT;
>>>> +        req->sector_num = virtio_ldq_p(VIRTIO_DEVICE(req->dev),
>>>> +                                       &req->out.sector);
>>>> +
>>>> +        if (!virtio_blk_sect_range_ok(req->dev, req->sector_num,
>>>> +                                      req->qiov.size)) {
>>>> +            virtio_blk_req_complete(req, VIRTIO_BLK_S_IOERR);
>>>> +            virtio_blk_free_request(req);
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        if (is_write) {
>>>> +            qemu_iovec_init_external(&req->qiov, iov, out_num);
>>>> +            trace_virtio_blk_handle_write(req, req->sector_num,
>>>> +                                          req->qiov.size / BDRV_SECTOR_SIZE);
>>>> +        } else {
>>>> +            qemu_iovec_init_external(&req->qiov, in_iov, in_num);
>>>> +            trace_virtio_blk_handle_read(req, req->sector_num,
>>>> +                                         req->qiov.size / BDRV_SECTOR_SIZE);
>>>> +        }
>>> Before this patch, req->qiov (and subsequently req->qiov.size) was initialized before virtio_blk_sect_range_ok() was called (with req->qiov.size as an argument). Is it fine to change that?
>> Oops, thats definetly a mistake. Half of the checks in virtio_blk_sect_range_ok won't work with the size set.
>>
>>>> +
>>>> +        block_acct_start(blk_get_stats(req->dev->blk),
>>>> +                         &req->acct, req->qiov.size,
>>>> +                         is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>>>> +
>>>> +        /* merge would exceed maximum number of requests or IO direction
>>>> +         * changes */
>>>> +        if (mrb->num_reqs > 0 && (mrb->num_reqs == VIRTIO_BLK_MAX_MERGE_REQS ||
>>>> +                                  is_write != mrb->is_write)) {
>>>> +            virtio_submit_multireq(req->dev->blk, mrb);
>>>> +        }
>>>> +
>>>> +        mrb->reqs[mrb->num_reqs++] = req;
>>> Somehow I'd like an assert(mrb->num_reqs < VIRTIO_BLK_MAX_MERGE_REQS) before this. Feel free to ignore my request.
>> This can't happen
>
> Right. If it could, I would not propose an assert() but a real error message. assert()s are for cases that "will never happen".
>
>> and the check for this is right above.
>
> Assuming virtio_submit_multireq() actually reduces mrb->num_reqs. Of course it will now, but if someone accidentally breaks it for some case...
>
>> Of what scenario do you think of that could overflow num_reqs?
>
> With this patch alone: Nothing. But I can imagine someone breaking virtio_submit_multireq()'s contract ("reduce mrb->num_reqs to 0").
>
> My point is that it'll be really bad if this array access overflows for some really stupid reason. It's up to you whether you want to make the additional check (which I don't think will cost any measurable performance) or not.

I will add an assertion.

>
>>>> +        mrb->is_write = is_write;
>>>> +        break;
>>>> +    }
>>>> +    case VIRTIO_BLK_T_FLUSH:
>>> I think VIRTIO_BLK_T_FLUSH_OUT is missing (that's the reason it was "type & VIRTIO_BLK_T_..." before instead of "type == VIRTIO_BLK_T_...").
>> VIRTIO_BLK_T_FLUSH_OUT is deprecated and not defined in the headers of qemu anymore. I can add it, but then I would also add VIRTIO_BLK_T_SCSI_CMD_OUT.
>
> Hm, okay. In that case, I might once again propose using type & ~(VIRTIO_BLK_T_BARRIER | VIRTIO_BLK_T_OUT) or type & 0xfe (or type & 0xff & ~VIRTIO_BLK_T_OUT).

If VIRTIO_BLK_T_OUT is no officical flag, I would not mask it out and add VIRTIO_BLK_T_FLUSH_OUT and VIRTIO_BLK_T_SCSI_OUT instead.
In theory I would then not have to mask type at all because we do not support barriers and in theory no guest should sent it. But since we ignored
it before we should better do it also in the future to not break broken guest support.

>
>>>>            virtio_blk_handle_flush(req, mrb);
>>>> -    } else if (type & VIRTIO_BLK_T_SCSI_CMD) {
>>>> +        break;
>>>> +    case VIRTIO_BLK_T_SCSI_CMD:
>>> And VIRTIO_BLK_T_SCSI_CMD_OUT here.
>> See above. I will define them in the headers.
>>
>>>
>>> The overall logic of this patch seems good to me (although I had some trouble following which QIOVs had which purpose, which were allocated and which were not, and so on...).
>> If you tell me at which point it was most difficult I can try to add comments.
>
> Well, virtio_submit_multireq2() first stored the qiov->iov pointer and than overwrote that (qiov->iov) through qemu_iovec_init(). I think a comment about this being because the I/O vectors are external (and thus cannot be changed) might have helped.
>
> Then I was wondering whether you'd leak _iov until it appeared to me that this wasn't the case because _iov comes from the guest or just generally from somewhere outside of this file.
>
> And finally I was wondering about the if (req->qiov.nalloc != -1) { qemu_iovec_destroy(&req->qiov); } in virtio_blk_rw_complete(), until I figured out that most of the I/O vectors are external vectors (indicated by qiov.nalloc == -1) and thus are not to be freed here; and there is only one case where qiov.nalloc != -1 and that is the QIOV which has been created to contain all merged requests; but this was mostly due to me not having had too much contact with the allocation and distribution of I/O vectors yet.

I will add comments at these locations.

Peter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread
  2015-01-30 21:15       ` Kevin Wolf
@ 2015-01-31 19:46         ` Peter Lieven
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Lieven @ 2015-01-31 19:46 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: famz, benoit, ming.lei, armbru, qemu-devel, stefanha, pbonzini,
	Max Reitz

Am 30.01.2015 um 22:15 schrieb Kevin Wolf:
> Am 30.01.2015 um 22:05 hat Peter Lieven geschrieben:
>> Am 30.01.2015 um 18:16 schrieb Max Reitz:
>>> On 2015-01-30 at 09:33, Peter Lieven wrote:
>>>> this patch finally introduces multiread support to virtio-blk. While
>>>> multiwrite support was there for a long time, read support was missing.
>>>>
>>>> The complete merge logic is moved into virtio-blk.c which has
>>>> been the only user of request merging ever since. This is required
>>>> to be able to merge chunks of requests and immediately invoke callbacks
>>>> for those requests. Secondly, this is required to switch to
>>>> direct invocation of coroutines which is planned at a later stage.
>>>>
>>>> The following benchmarks show the performance of running fio with
>>>> 4 worker threads on a local ram disk. The numbers show the average
>>>> of 10 test runs after 1 run as warmup phase.
>>>>
>>>>                |        4k        |       64k        |        4k
>>>> MB/s          | rd seq | rd rand | rd seq | rd rand | wr seq | wr rand
>>>> --------------+--------+---------+--------+---------+--------+--------
>>>> master        | 1221   | 1187    | 4178   | 4114    | 1745   | 1213
>>>> multiread     | 1829   | 1189    | 4639   | 4110    | 1894   | 1216
>>>>
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> ---
>>>>   hw/block/dataplane/virtio-blk.c |   8 +-
>>>>   hw/block/virtio-blk.c           | 288 +++++++++++++++++++++++++++-------------
>>>>   include/hw/virtio/virtio-blk.h  |  19 +--
>>>>   trace-events                    |   1 +
>>>>   4 files changed, 210 insertions(+), 106 deletions(-)
>>>> +    int64_t sector_num = 0;
>>>> +
>>>> +    if (mrb->num_reqs == 1) {
>>>> +        virtio_submit_multireq2(blk, mrb, 0, 1, -1);
>>>> +        mrb->num_reqs = 0;
>>>>           return;
>>>>       }
>>>>   -    ret = blk_aio_multiwrite(blk, mrb->blkreq, mrb->num_writes);
>>>> -    if (ret != 0) {
>>>> -        for (i = 0; i < mrb->num_writes; i++) {
>>>> -            if (mrb->blkreq[i].error) {
>>>> -                virtio_blk_rw_complete(mrb->blkreq[i].opaque, -EIO);
>>>> +    max_xfer_len = blk_get_max_transfer_length(mrb->reqs[0]->dev->blk);
>>>> +    max_xfer_len = MIN_NON_ZERO(max_xfer_len, INT_MAX);
>>>> +
>>>> +    qsort(mrb->reqs, mrb->num_reqs, sizeof(*mrb->reqs),
>>>> +          &virtio_multireq_compare);
>>>> +
>>>> +    for (i = 0; i < mrb->num_reqs; i++) {
>>>> +        VirtIOBlockReq *req = mrb->reqs[i];
>>>> +        if (num_reqs > 0) {
>>>> +            bool merge = true;
>>>> +
>>>> +            /* merge would exceed maximum number of IOVs */
>>>> +            if (niov + req->qiov.niov + 1 > IOV_MAX) {
>>> Hm, why the +1?
>> A really good question. I copied this piece from the old merge routine. It seems
>> definetely wrong.
> The old code merged requests even if they were overlapping. This could
> result in one area being split in two.
>
> I think you don't support this here, so removing the + 1 is probably
> okay.

I don't support it because it was a good source for bugs in the past and I think
a good guest should not create overlapping requests at all.

Peter

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-01-31 19:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-30 14:32 [Qemu-devel] [PATCH v3 0/4] virtio-blk: add multiread support Peter Lieven
2015-01-30 14:32 ` [Qemu-devel] [PATCH v3 1/4] block: add accounting for merged requests Peter Lieven
2015-01-30 15:53   ` Max Reitz
2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 2/4] hw/virtio-blk: add a constant for max number of " Peter Lieven
2015-01-30 16:01   ` Max Reitz
2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 3/4] block-backend: expose bs->bl.max_transfer_length Peter Lieven
2015-01-30 16:04   ` Max Reitz
2015-01-30 14:33 ` [Qemu-devel] [PATCH v3 4/4] virtio-blk: introduce multiread Peter Lieven
2015-01-30 17:16   ` Max Reitz
2015-01-30 21:05     ` Peter Lieven
2015-01-30 21:15       ` Kevin Wolf
2015-01-31 19:46         ` Peter Lieven
2015-01-30 21:30       ` Max Reitz
2015-01-31 19:45         ` Peter Lieven

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).