[Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling
@ 2011-08-09  4:17 Zhi Yong Wu
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 1/4] block: add the command line support Zhi Yong Wu
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, pair, stefanha, kvm, mtosatti, Zhi Yong Wu, zwu.kernel,
	ryanh, luowenj

The main goal of the patch is to effectively cap the disk I/O speed or counts of one single VM.It is only one draft, so it unavoidably has some drawbacks, if you catch them, please let me know.

The patch will mainly introduce one block I/O throttling algorithm, one timer and one block queue for each I/O limits enabled drive.

When a block request is coming in, the throttling algorithm will check if its I/O rate or counts exceed the limits; if yes, then it will enqueue to the block queue; The timer will handle the I/O requests in it.

Some available features follow as below:
(1) global bps limit.
    -drive bps=xxx            in bytes/s
(2) only read bps limit
    -drive bps_rd=xxx         in bytes/s
(3) only write bps limit
    -drive bps_wr=xxx         in bytes/s
(4) global iops limit
    -drive iops=xxx           in ios/s
(5) only read iops limit
    -drive iops_rd=xxx        in ios/s
(6) only write iops limit
    -drive iops_wr=xxx        in ios/s
(7) the combination of some limits.
    -drive bps=xxx,iops=xxx

Known Limitations:
(1) #1 can not coexist with #2, #3
(2) #4 can not coexist with #5, #6
(3) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.

Zhi Yong Wu (4):
  v5: add qmp/hmp support.
      Adjust the codes based on stefan's comments
  block: add the command line support
  block: add the block queue support
  block: add block timer and block throttling algorithm
  qmp/hmp: add block_set_io_throttle

  v4: fix memory leaking based on ryan's feedback.

  v3: Added the code for extending slice time, and modified the method to compute wait time for the timer.

  v2: The codes V2 for QEMU disk I/O limits.
      Modified the codes mainly based on stefan's comments.

  v1: Submit the codes for QEMU disk I/O limits.
      Only a code draft.

 Makefile.objs     |    2 +-
 block.c           |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 block.h           |    6 +-
 block/blk-queue.c |  141 ++++++++++++++++++++++
 block/blk-queue.h |   73 +++++++++++
 block_int.h       |   30 +++++
 blockdev.c        |  108 +++++++++++++++++
 blockdev.h        |    2 +
 hmp-commands.hx   |   15 +++
 qemu-config.c     |   24 ++++
 qemu-option.c     |   17 +++
 qemu-option.h     |    1 +
 qemu-options.hx   |    1 +
 qerror.c          |    4 +
 qerror.h          |    3 +
 qmp-commands.hx   |   52 ++++++++-
 16 files changed, 813 insertions(+), 13 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

-- 
1.7.2.3

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v5 1/4] block: add the command line support
  2011-08-09  4:17 [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Zhi Yong Wu
@ 2011-08-09  4:17 ` Zhi Yong Wu
  2011-08-09 12:25   ` Stefan Hajnoczi
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 2/4] block: add the block queue support Zhi Yong Wu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, pair, stefanha, kvm, mtosatti, Zhi Yong Wu, zwu.kernel,
	ryanh, luowenj

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 Makefile.objs   |    2 +-
 blockdev.c      |   39 +++++++++++++++++++++++++++++++++++++++
 qemu-config.c   |   24 ++++++++++++++++++++++++
 qemu-option.c   |   17 +++++++++++++++++
 qemu-option.h   |    1 +
 qemu-options.hx |    1 +
 6 files changed, 83 insertions(+), 1 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index 9f99ed4..06f2033 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
 block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-nested-y += qed-check.o
-block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
+block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o
 block-nested-$(CONFIG_WIN32) += raw-win32.o
 block-nested-$(CONFIG_POSIX) += raw-posix.o
 block-nested-$(CONFIG_CURL) += curl.o
diff --git a/blockdev.c b/blockdev.c
index c263663..9c78548 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -238,6 +238,10 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
     int on_read_error, on_write_error;
     const char *devaddr;
     DriveInfo *dinfo;
+    BlockIOLimit io_limits;
+    bool iol_flag = false;
+    const char *iol_opts[7] = {"bps", "bps_rd", "bps_wr",
+                                "iops", "iops_rd", "iops_wr"};
     int is_extboot = 0;
     int snapshot = 0;
     int ret;
@@ -372,6 +376,36 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
         return NULL;
     }
 
+    /* disk io throttling */
+    iol_flag = qemu_opt_io_limits_enable_flag(opts, iol_opts);
+    if (iol_flag) {
+        memset(&io_limits, 0, sizeof(BlockIOLimit));
+
+        io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
+                           qemu_opt_get_number(opts, "bps", 0);
+        io_limits.bps[BLOCK_IO_LIMIT_READ]   =
+                           qemu_opt_get_number(opts, "bps_rd", 0);
+        io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
+                           qemu_opt_get_number(opts, "bps_wr", 0);
+        io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
+                           qemu_opt_get_number(opts, "iops", 0);
+        io_limits.iops[BLOCK_IO_LIMIT_READ]  =
+                           qemu_opt_get_number(opts, "iops_rd", 0);
+        io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
+                           qemu_opt_get_number(opts, "iops_wr", 0);
+
+        if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
+            && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
+            || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
+            || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
+            && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
+            || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
+            error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) \
+                                        cannot be used at the same time");
+            return NULL;
+        }
+    }
+
     on_write_error = BLOCK_ERR_STOP_ENOSPC;
     if ((buf = qemu_opt_get(opts, "werror")) != NULL) {
         if (type != IF_IDE && type != IF_SCSI && type != IF_VIRTIO && type != IF_NONE) {
@@ -483,6 +517,11 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
 
     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
 
+    /* disk I/O throttling */
+    if (iol_flag) {
+        bdrv_set_io_limits(dinfo->bdrv, &io_limits);
+    }
+
     switch(type) {
     case IF_IDE:
     case IF_SCSI:
diff --git a/qemu-config.c b/qemu-config.c
index efa892c..9232bbb 100644
--- a/qemu-config.c
+++ b/qemu-config.c
@@ -82,6 +82,30 @@ static QemuOptsList qemu_drive_opts = {
             .name = "boot",
             .type = QEMU_OPT_BOOL,
             .help = "make this a boot drive",
+        },{
+            .name = "iops",
+            .type = QEMU_OPT_NUMBER,
+            .help = "limit total I/O operations per second",
+        },{
+            .name = "iops_rd",
+            .type = QEMU_OPT_NUMBER,
+            .help = "limit read operations per second",
+        },{
+            .name = "iops_wr",
+            .type = QEMU_OPT_NUMBER,
+            .help = "limit write operations per second",
+        },{
+            .name = "bps",
+            .type = QEMU_OPT_NUMBER,
+            .help = "limit total bytes per second",
+        },{
+            .name = "bps_rd",
+            .type = QEMU_OPT_NUMBER,
+            .help = "limit read bytes per second",
+        },{
+            .name = "bps_wr",
+            .type = QEMU_OPT_NUMBER,
+            .help = "limit write bytes per second",
         },
         { /* end of list */ }
     },
diff --git a/qemu-option.c b/qemu-option.c
index 65db542..c5aa96a 100644
--- a/qemu-option.c
+++ b/qemu-option.c
@@ -562,6 +562,23 @@ uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t defval)
     return opt->value.uint;
 }
 
+bool qemu_opt_io_limits_enable_flag(QemuOpts *opts, const char **iol_opts)
+{
+    int i;
+    uint64_t opt_val = 0;
+    bool iol_flag    = false;
+
+    for (i = 0; iol_opts[i]; i++) {
+        opt_val = qemu_opt_get_number(opts, iol_opts[i], 0);
+        if (opt_val != 0) {
+            iol_flag = true;
+            break;
+        }
+    }
+
+    return iol_flag;
+}
+
 uint64_t qemu_opt_get_size(QemuOpts *opts, const char *name, uint64_t defval)
 {
     QemuOpt *opt = qemu_opt_find(opts, name);
diff --git a/qemu-option.h b/qemu-option.h
index b515813..fc909f9 100644
--- a/qemu-option.h
+++ b/qemu-option.h
@@ -107,6 +107,7 @@ struct QemuOptsList {
 const char *qemu_opt_get(QemuOpts *opts, const char *name);
 int qemu_opt_get_bool(QemuOpts *opts, const char *name, int defval);
 uint64_t qemu_opt_get_number(QemuOpts *opts, const char *name, uint64_t defval);
+bool qemu_opt_io_limits_enable_flag(QemuOpts *opts, const char **iol_opts);
 uint64_t qemu_opt_get_size(QemuOpts *opts, const char *name, uint64_t defval);
 int qemu_opt_set(QemuOpts *opts, const char *name, const char *value);
 typedef int (*qemu_opt_loopfunc)(const char *name, const char *value, void *opaque);
diff --git a/qemu-options.hx b/qemu-options.hx
index cb3347e..ae219f5 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -121,6 +121,7 @@ DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cache=writethrough|writeback|none|unsafe][,format=f]\n"
     "       [,serial=s][,addr=A][,id=name][,aio=threads|native]\n"
     "       [,readonly=on|off][,boot=on|off]\n"
+    "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]][[,iops=i]|[[,iops_rd=r][,iops_wr=w]]\n"
     "                use 'file' as a drive image\n", QEMU_ARCH_ALL)
 STEXI
 @item -drive @var{option}[,@var{option}[,@var{option}[,...]]]
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-09  4:17 [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Zhi Yong Wu
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 1/4] block: add the command line support Zhi Yong Wu
@ 2011-08-09  4:17 ` Zhi Yong Wu
  2011-08-09  8:46   ` Ram Pai
  2011-08-09 12:49   ` Stefan Hajnoczi
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm Zhi Yong Wu
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, pair, stefanha, kvm, mtosatti, Zhi Yong Wu, zwu.kernel,
	ryanh, luowenj

The patch introduce one block queue for QEMU block layer.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 block/blk-queue.c |  141 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 block/blk-queue.h |   73 +++++++++++++++++++++++++++
 2 files changed, 214 insertions(+), 0 deletions(-)
 create mode 100644 block/blk-queue.c
 create mode 100644 block/blk-queue.h

diff --git a/block/blk-queue.c b/block/blk-queue.c
new file mode 100644
index 0000000..f36f3e2
--- /dev/null
+++ b/block/blk-queue.c
@@ -0,0 +1,141 @@
+/*
+ * QEMU System Emulator queue definition for block layer
+ *
+ * Copyright (c) 2011 Zhi Yong Wu  <zwu.kernel@gmail.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "block_int.h"
+#include "qemu-queue.h"
+#include "block/blk-queue.h"
+
+/* The APIs for block request queue on qemu block layer.
+ */
+
+static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
+{
+    qemu_aio_release(acb);
+}
+
+static AIOPool block_queue_pool = {
+    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
+    .cancel             = qemu_block_queue_cancel,
+};
+
+static void qemu_block_queue_callback(void *opaque, int ret)
+{
+    BlockDriverAIOCB *acb = opaque;
+
+    qemu_aio_release(acb);
+}
+
+BlockQueue *qemu_new_block_queue(void)
+{
+    BlockQueue *queue;
+
+    queue = qemu_mallocz(sizeof(BlockQueue));
+
+    QTAILQ_INIT(&queue->requests);
+
+    return queue;
+}
+
+void qemu_del_block_queue(BlockQueue *queue)
+{
+    BlockIORequest *request, *next;
+
+    QTAILQ_FOREACH_SAFE(request, &queue->requests, entry, next) {
+        QTAILQ_REMOVE(&queue->requests, request, entry);
+        qemu_free(request);
+    }
+
+    qemu_free(queue);
+}
+
+BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
+                        BlockDriverState *bs,
+                        BlockRequestHandler *handler,
+                        int64_t sector_num,
+                        QEMUIOVector *qiov,
+                        int nb_sectors,
+                        BlockDriverCompletionFunc *cb,
+                        void *opaque)
+{
+    BlockIORequest *request;
+    BlockDriverAIOCB *acb;
+
+    request = qemu_malloc(sizeof(BlockIORequest));
+    request->bs = bs;
+    request->handler = handler;
+    request->sector_num = sector_num;
+    request->qiov = qiov;
+    request->nb_sectors = nb_sectors;
+    request->cb = cb;
+    request->opaque = opaque;
+
+    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
+
+    acb = qemu_aio_get(&block_queue_pool, bs,
+                       qemu_block_queue_callback, opaque);
+
+    request->acb = acb;
+
+    return acb;
+}
+
+int qemu_block_queue_handler(BlockIORequest *request)
+{
+    int ret;
+    BlockDriverAIOCB *res;
+
+    /* indicate this req is from block queue */
+    request->bs->req_from_queue = true;
+
+    res = request->handler(request->bs, request->sector_num,
+                           request->qiov, request->nb_sectors,
+                           request->cb, request->opaque);
+
+    if (request->acb) {
+        qemu_block_queue_callback(request->acb, 0);
+    }
+
+    ret = (res == NULL) ? 0 : 1;
+
+    return ret;
+}
+
+void qemu_block_queue_flush(BlockQueue *queue)
+{
+    while (!QTAILQ_EMPTY(&queue->requests)) {
+        BlockIORequest *request = NULL;
+        int ret = 0;
+
+        request = QTAILQ_FIRST(&queue->requests);
+        QTAILQ_REMOVE(&queue->requests, request, entry);
+
+        ret = qemu_block_queue_handler(request);
+        if (ret == 0) {
+            QTAILQ_INSERT_HEAD(&queue->requests, request, entry);
+            break;
+        }
+
+        qemu_free(request);
+    }
+}
diff --git a/block/blk-queue.h b/block/blk-queue.h
new file mode 100644
index 0000000..281b679
--- /dev/null
+++ b/block/blk-queue.h
@@ -0,0 +1,73 @@
+/*
+ * QEMU System Emulator queue declaration for block layer
+ *
+ * Copyright (c) 2011 Zhi Yong Wu  <zwu.kernel@gmail.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#ifndef QEMU_BLOCK_QUEUE_H
+#define QEMU_BLOCK_QUEUE_H
+
+#include "block.h"
+#include "qemu-queue.h"
+#include "qemu-common.h"
+
+typedef BlockDriverAIOCB* (BlockRequestHandler) (BlockDriverState *bs,
+                                int64_t sector_num, QEMUIOVector *qiov,
+                                int nb_sectors, BlockDriverCompletionFunc *cb,
+                                void *opaque);
+
+struct BlockIORequest {
+    QTAILQ_ENTRY(BlockIORequest) entry;
+    BlockDriverState *bs;
+    BlockRequestHandler *handler;
+    int64_t sector_num;
+    QEMUIOVector *qiov;
+    int nb_sectors;
+    BlockDriverCompletionFunc *cb;
+    void *opaque;
+    BlockDriverAIOCB *acb;
+};
+
+typedef struct BlockIORequest BlockIORequest;
+
+struct BlockQueue {
+    QTAILQ_HEAD(requests, BlockIORequest) requests;
+};
+
+typedef struct BlockQueue BlockQueue;
+
+BlockQueue *qemu_new_block_queue(void);
+
+void qemu_del_block_queue(BlockQueue *queue);
+
+BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
+                        BlockDriverState *bs,
+                        BlockRequestHandler *handler,
+                        int64_t sector_num,
+                        QEMUIOVector *qiov,
+                        int nb_sectors,
+                        BlockDriverCompletionFunc *cb,
+                        void *opaque);
+
+int qemu_block_queue_handler(BlockIORequest *request);
+
+void qemu_block_queue_flush(BlockQueue *queue);
+#endif /* QEMU_BLOCK_QUEUE_H */
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-09  4:17 [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Zhi Yong Wu
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 1/4] block: add the command line support Zhi Yong Wu
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 2/4] block: add the block queue support Zhi Yong Wu
@ 2011-08-09  4:17 ` Zhi Yong Wu
  2011-08-09  8:57   ` Ram Pai
  2011-08-09 15:19   ` Stefan Hajnoczi
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 4/4] qmp/hmp: add block_set_io_throttle Zhi Yong Wu
  2011-08-09 12:08 ` [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Stefan Hajnoczi
  4 siblings, 2 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, pair, stefanha, kvm, mtosatti, Zhi Yong Wu, zwu.kernel,
	ryanh, luowenj

Note:
      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
      2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits.

For these problems, if you have nice thought, pls let us know.:)

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 block.c     |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 block.h     |    6 +-
 block_int.h |   30 +++++
 3 files changed, 372 insertions(+), 11 deletions(-)

diff --git a/block.c b/block.c
index 24a25d5..8fd6643 100644
--- a/block.c
+++ b/block.c
@@ -29,6 +29,9 @@
 #include "module.h"
 #include "qemu-objects.h"
 
+#include "qemu-timer.h"
+#include "block/blk-queue.h"
+
 #ifdef CONFIG_BSD
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
 static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
                          const uint8_t *buf, int nb_sectors);
 
+static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
+        bool is_write, double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
+        double elapsed_time, uint64_t *wait);
+static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
+        bool is_write, uint64_t *wait);
+
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
     QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
 }
 #endif
 
+/* throttling disk I/O limits */
+void bdrv_io_limits_disable(BlockDriverState *bs)
+{
+    bs->io_limits_enabled = false;
+    bs->req_from_queue    = false;
+
+    if (bs->block_queue) {
+        qemu_block_queue_flush(bs->block_queue);
+        qemu_del_block_queue(bs->block_queue);
+    }
+
+    if (bs->block_timer) {
+        qemu_del_timer(bs->block_timer);
+        qemu_free_timer(bs->block_timer);
+    }
+
+    bs->slice_start[0]   = 0;
+    bs->slice_start[1]   = 0;
+
+    bs->slice_end[0]     = 0;
+    bs->slice_end[1]     = 0;
+}
+
+static void bdrv_block_timer(void *opaque)
+{
+    BlockDriverState *bs = opaque;
+    BlockQueue *queue = bs->block_queue;
+
+    qemu_block_queue_flush(queue);
+}
+
+void bdrv_io_limits_enable(BlockDriverState *bs)
+{
+    bs->req_from_queue = false;
+
+    bs->block_queue    = qemu_new_block_queue();
+    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
+
+    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
+    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
+
+    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
+                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
+    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
+                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
+}
+
+bool bdrv_io_limits_enabled(BlockDriverState *bs)
+{
+    BlockIOLimit *io_limits = &bs->io_limits;
+    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
+         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
+         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
+         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
+         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
+         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
+        return false;
+    }
+
+    return true;
+}
+
 /* check if the path starts with "<protocol>:" */
 static int path_has_protocol(const char *path)
 {
@@ -642,6 +714,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
             bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
     }
 
+    /* throttling disk I/O limits */
+    if (bs->io_limits_enabled) {
+        bdrv_io_limits_enable(bs);
+    }
+
     return 0;
 
 unlink_and_fail:
@@ -680,6 +757,16 @@ void bdrv_close(BlockDriverState *bs)
         if (bs->change_cb)
             bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
     }
+
+    /* throttling disk I/O limits */
+    if (bs->block_queue) {
+        qemu_del_block_queue(bs->block_queue);
+    }
+
+    if (bs->block_timer) {
+        qemu_del_timer(bs->block_timer);
+        qemu_free_timer(bs->block_timer);
+    }
 }
 
 void bdrv_close_all(void)
@@ -1312,6 +1399,15 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
     *psecs = bs->secs;
 }
 
+/* throttling disk io limits */
+void bdrv_set_io_limits(BlockDriverState *bs,
+                            BlockIOLimit *io_limits)
+{
+    memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
+    bs->io_limits = *io_limits;
+    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
+}
+
 /* Recognize floppy formats */
 typedef struct FDFormat {
     FDriveType drive;
@@ -1707,6 +1803,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
                             qdict_get_bool(qdict, "ro"),
                             qdict_get_str(qdict, "drv"),
                             qdict_get_bool(qdict, "encrypted"));
+
+        monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
+                            " bps_wr=%" PRId64 " iops=%" PRId64
+                            " iops_rd=%" PRId64 " iops_wr=%" PRId64,
+                            qdict_get_int(qdict, "bps"),
+                            qdict_get_int(qdict, "bps_rd"),
+                            qdict_get_int(qdict, "bps_wr"),
+                            qdict_get_int(qdict, "iops"),
+                            qdict_get_int(qdict, "iops_rd"),
+                            qdict_get_int(qdict, "iops_wr"));
     } else {
         monitor_printf(mon, " [not inserted]");
     }
@@ -1739,10 +1845,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
             QDict *bs_dict = qobject_to_qdict(bs_obj);
 
             obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
-                                     "'encrypted': %i }",
+                                     "'encrypted': %i, "
+                                     "'bps': %" PRId64 ","
+                                     "'bps_rd': %" PRId64 ","
+                                     "'bps_wr': %" PRId64 ","
+                                     "'iops': %" PRId64 ","
+                                     "'iops_rd': %" PRId64 ","
+                                     "'iops_wr': %" PRId64 "}",
                                      bs->filename, bs->read_only,
                                      bs->drv->format_name,
-                                     bdrv_is_encrypted(bs));
+                                     bdrv_is_encrypted(bs),
+                                     bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
+                                     bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
+                                     bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
+                                     bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
+                                     bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
+                                     bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
             if (bs->backing_file[0] != '\0') {
                 QDict *qdict = qobject_to_qdict(obj);
                 qdict_put(qdict, "backing_file",
@@ -2111,6 +2229,165 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
     return buf;
 }
 
+static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
+                 bool is_write, double elapsed_time, uint64_t *wait) {
+    uint64_t bps_limit = 0;
+    double   bytes_limit, bytes_disp, bytes_res;
+    double   slice_time, wait_time;
+
+    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
+        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
+    } else if (bs->io_limits.bps[is_write]) {
+        bps_limit = bs->io_limits.bps[is_write];
+    } else {
+        if (wait) {
+            *wait = 0;
+        }
+
+        return false;
+    }
+
+    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
+    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
+    bytes_limit = bps_limit * slice_time;
+    bytes_disp  = bs->io_disps.bytes[is_write];
+    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
+        bytes_disp += bs->io_disps.bytes[!is_write];
+    }
+
+    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
+
+    if (bytes_disp + bytes_res <= bytes_limit) {
+        if (wait) {
+            *wait = 0;
+        }
+
+        return false;
+    }
+
+    /* Calc approx time to dispatch */
+    wait_time = (bytes_disp + bytes_res) / bps_limit - elapsed_time;
+
+    if (wait) {
+        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
+    }
+
+    return true;
+}
+
+static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
+                             double elapsed_time, uint64_t *wait) {
+    uint64_t iops_limit = 0;
+    double   ios_limit, ios_disp;
+    double   slice_time, wait_time;
+
+    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
+        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
+    } else if (bs->io_limits.iops[is_write]) {
+        iops_limit = bs->io_limits.iops[is_write];
+    } else {
+        if (wait) {
+            *wait = 0;
+        }
+
+        return false;
+    }
+
+    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
+    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
+    ios_limit  = iops_limit * slice_time;
+    ios_disp   = bs->io_disps.ios[is_write];
+    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
+        ios_disp += bs->io_disps.ios[!is_write];
+    }
+
+    if (ios_disp + 1 <= ios_limit) {
+        if (wait) {
+            *wait = 0;
+        }
+
+        return false;
+    }
+
+    /* Calc approx time to dispatch */
+    wait_time = (ios_disp + 1) / iops_limit;
+    if (wait_time > elapsed_time) {
+        wait_time = wait_time - elapsed_time;
+    } else {
+        wait_time = 0;
+    }
+
+    if (wait) {
+        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
+    }
+
+    return true;
+}
+
+static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
+                           bool is_write, uint64_t *wait) {
+    int64_t  real_time, real_slice;
+    uint64_t bps_wait = 0, iops_wait = 0, max_wait;
+    double   elapsed_time;
+    int      bps_ret, iops_ret;
+
+    real_time = qemu_get_clock_ns(vm_clock);
+    real_slice = bs->slice_end[is_write] - bs->slice_start[is_write];
+    if ((bs->slice_start[is_write] < real_time)
+        && (bs->slice_end[is_write] > real_time)) {
+        bs->slice_end[is_write]   = real_time + BLOCK_IO_SLICE_TIME;
+    } else {
+        bs->slice_start[is_write]     = real_time;
+        bs->slice_end[is_write]       = real_time + BLOCK_IO_SLICE_TIME;
+
+        bs->io_disps.bytes[is_write]  = 0;
+        bs->io_disps.bytes[!is_write] = 0;
+
+        bs->io_disps.ios[is_write]    = 0;
+        bs->io_disps.ios[!is_write]   = 0;
+    }
+
+    /* If a limit was exceeded, immediately queue this request */
+    if (!bs->req_from_queue
+        && !QTAILQ_EMPTY(&bs->block_queue->requests)) {
+        if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]
+            || bs->io_limits.bps[is_write] || bs->io_limits.iops[is_write]
+            || bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
+            if (wait) {
+                *wait = 0;
+            }
+
+            return true;
+        }
+    }
+
+    elapsed_time  = real_time - bs->slice_start[is_write];
+    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);
+
+    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
+                                      is_write, elapsed_time, &bps_wait);
+    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
+                                      elapsed_time, &iops_wait);
+    if (bps_ret || iops_ret) {
+        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
+        if (wait) {
+            *wait = max_wait;
+        }
+
+        real_time = qemu_get_clock_ns(vm_clock);
+        if (bs->slice_end[is_write] < real_time + max_wait) {
+            bs->slice_end[is_write] = real_time + max_wait;
+        }
+
+        return true;
+    }
+
+    if (wait) {
+        *wait = 0;
+    }
+
+    return false;
+}
 
 /**************************************************************/
 /* async I/Os */
@@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
 {
     BlockDriver *drv = bs->drv;
     BlockDriverAIOCB *ret;
+    uint64_t wait_time = 0;
 
     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
 
-    if (!drv)
-        return NULL;
-    if (bdrv_check_request(bs, sector_num, nb_sectors))
+    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
+        if (bs->io_limits_enabled) {
+            bs->req_from_queue = false;
+        }
         return NULL;
+    }
+
+    /* throttling disk read I/O */
+    if (bs->io_limits_enabled) {
+        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
+            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
+                              sector_num, qiov, nb_sectors, cb, opaque);
+            qemu_mod_timer(bs->block_timer,
+                       wait_time + qemu_get_clock_ns(vm_clock));
+            bs->req_from_queue = false;
+            return ret;
+        }
+    }
 
     ret = drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
                               cb, opaque);
@@ -2136,6 +2428,16 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
 	/* Update stats even though technically transfer has not happened. */
 	bs->rd_bytes += (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
 	bs->rd_ops ++;
+
+        if (bs->io_limits_enabled) {
+            bs->io_disps.bytes[BLOCK_IO_LIMIT_READ] +=
+                              (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
+            bs->io_disps.ios[BLOCK_IO_LIMIT_READ]++;
+        }
+    }
+
+    if (bs->io_limits_enabled) {
+        bs->req_from_queue = false;
     }
 
     return ret;
@@ -2184,15 +2486,18 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
     BlockDriver *drv = bs->drv;
     BlockDriverAIOCB *ret;
     BlockCompleteData *blk_cb_data;
+    uint64_t wait_time = 0;
 
     trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
 
-    if (!drv)
-        return NULL;
-    if (bs->read_only)
-        return NULL;
-    if (bdrv_check_request(bs, sector_num, nb_sectors))
+    if (!drv || bs->read_only
+        || bdrv_check_request(bs, sector_num, nb_sectors)) {
+        if (bs->io_limits_enabled) {
+            bs->req_from_queue = false;
+        }
+
         return NULL;
+    }
 
     if (bs->dirty_bitmap) {
         blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
@@ -2201,6 +2506,18 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
         opaque = blk_cb_data;
     }
 
+    /* throttling disk write I/O */
+    if (bs->io_limits_enabled) {
+        if (bdrv_exceed_io_limits(bs, nb_sectors, true, &wait_time)) {
+            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_writev,
+                                  sector_num, qiov, nb_sectors, cb, opaque);
+            qemu_mod_timer(bs->block_timer,
+                                  wait_time + qemu_get_clock_ns(vm_clock));
+            bs->req_from_queue = false;
+            return ret;
+        }
+    }
+
     ret = drv->bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
                                cb, opaque);
 
@@ -2211,6 +2528,16 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
         if (bs->wr_highest_sector < sector_num + nb_sectors - 1) {
             bs->wr_highest_sector = sector_num + nb_sectors - 1;
         }
+
+        if (bs->io_limits_enabled) {
+            bs->io_disps.bytes[BLOCK_IO_LIMIT_WRITE] +=
+                               (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
+            bs->io_disps.ios[BLOCK_IO_LIMIT_WRITE]++;
+        }
+    }
+
+    if (bs->io_limits_enabled) {
+        bs->req_from_queue = false;
     }
 
     return ret;
diff --git a/block.h b/block.h
index 859d1d9..3d02902 100644
--- a/block.h
+++ b/block.h
@@ -57,6 +57,11 @@ void bdrv_info(Monitor *mon, QObject **ret_data);
 void bdrv_stats_print(Monitor *mon, const QObject *data);
 void bdrv_info_stats(Monitor *mon, QObject **ret_data);
 
+/* disk I/O throttling */
+void bdrv_io_limits_enable(BlockDriverState *bs);
+void bdrv_io_limits_disable(BlockDriverState *bs);
+bool bdrv_io_limits_enabled(BlockDriverState *bs);
+
 void bdrv_init(void);
 void bdrv_init_with_whitelist(void);
 BlockDriver *bdrv_find_protocol(const char *filename);
@@ -97,7 +102,6 @@ int bdrv_change_backing_file(BlockDriverState *bs,
     const char *backing_file, const char *backing_fmt);
 void bdrv_register(BlockDriver *bdrv);
 
-
 typedef struct BdrvCheckResult {
     int corruptions;
     int leaks;
diff --git a/block_int.h b/block_int.h
index 1e265d2..a4cd458 100644
--- a/block_int.h
+++ b/block_int.h
@@ -27,10 +27,17 @@
 #include "block.h"
 #include "qemu-option.h"
 #include "qemu-queue.h"
+#include "block/blk-queue.h"
 
 #define BLOCK_FLAG_ENCRYPT	1
 #define BLOCK_FLAG_COMPAT6	4
 
+#define BLOCK_IO_LIMIT_READ     0
+#define BLOCK_IO_LIMIT_WRITE    1
+#define BLOCK_IO_LIMIT_TOTAL    2
+
+#define BLOCK_IO_SLICE_TIME     100000000
+
 #define BLOCK_OPT_SIZE          "size"
 #define BLOCK_OPT_ENCRYPT       "encryption"
 #define BLOCK_OPT_COMPAT6       "compat6"
@@ -46,6 +53,16 @@ typedef struct AIOPool {
     BlockDriverAIOCB *free_aiocb;
 } AIOPool;
 
+typedef struct BlockIOLimit {
+    uint64_t bps[3];
+    uint64_t iops[3];
+} BlockIOLimit;
+
+typedef struct BlockIODisp {
+    uint64_t bytes[2];
+    uint64_t ios[2];
+} BlockIODisp;
+
 struct BlockDriver {
     const char *format_name;
     int instance_size;
@@ -175,6 +192,16 @@ struct BlockDriverState {
 
     void *sync_aiocb;
 
+    /* the time for latest disk I/O */
+    int64_t slice_start[2];
+    int64_t slice_end[2];
+    BlockIOLimit io_limits;
+    BlockIODisp  io_disps;
+    BlockQueue   *block_queue;
+    QEMUTimer    *block_timer;
+    bool         io_limits_enabled;
+    bool         req_from_queue;
+
     /* I/O stats (display with "info blockstats"). */
     uint64_t rd_bytes;
     uint64_t wr_bytes;
@@ -222,6 +249,9 @@ void qemu_aio_release(void *p);
 
 void *qemu_blockalign(BlockDriverState *bs, size_t size);
 
+void bdrv_set_io_limits(BlockDriverState *bs,
+                            BlockIOLimit *io_limits);
+
 #ifdef _WIN32
 int is_windows_drive(const char *filename);
 #endif
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [Qemu-devel] [PATCH v5 4/4] qmp/hmp: add block_set_io_throttle
  2011-08-09  4:17 [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Zhi Yong Wu
                   ` (2 preceding siblings ...)
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm Zhi Yong Wu
@ 2011-08-09  4:17 ` Zhi Yong Wu
  2011-08-09 12:08 ` [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Stefan Hajnoczi
  4 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  4:17 UTC (permalink / raw)
  To: qemu-devel
  Cc: kwolf, pair, stefanha, kvm, mtosatti, Zhi Yong Wu, zwu.kernel,
	ryanh, luowenj

The patch introduce one new command block_set_io_throttle; For its usage syntax, if you have better idea, pls let me know.

Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
---
 blockdev.c      |   69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 blockdev.h      |    2 +
 hmp-commands.hx |   15 ++++++++++++
 qerror.c        |    4 +++
 qerror.h        |    3 ++
 qmp-commands.hx |   52 ++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 144 insertions(+), 1 deletions(-)

diff --git a/blockdev.c b/blockdev.c
index 9c78548..ddf7af9 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -782,6 +782,75 @@ int do_change_block(Monitor *mon, const char *device,
     return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
 }
 
+/* throttling disk I/O limits */
+int do_block_set_io_throttle(Monitor *mon,
+                       const QDict *qdict, QObject **ret_data)
+{
+    const char *devname = qdict_get_str(qdict, "device");
+    uint64_t bps        = qdict_get_try_int(qdict, "bps", -1);
+    uint64_t bps_rd     = qdict_get_try_int(qdict, "bps_rd", -1);
+    uint64_t bps_wr     = qdict_get_try_int(qdict, "bps_wr", -1);
+    uint64_t iops       = qdict_get_try_int(qdict, "iops", -1);
+    uint64_t iops_rd    = qdict_get_try_int(qdict, "iops_rd", -1);
+    uint64_t iops_wr    = qdict_get_try_int(qdict, "iops_wr", -1);
+    BlockDriverState *bs;
+
+    bs = bdrv_find(devname);
+    if (!bs) {
+        qerror_report(QERR_DEVICE_NOT_FOUND, devname);
+        return -1;
+    }
+
+    if ((bps == -1) && (bps_rd == -1) && (bps_wr == -1)
+        && (iops == -1) && (iops_rd == -1) && (iops_wr == -1)) {
+        qerror_report(QERR_MISSING_PARAMETER,
+                      "bps/bps_rd/bps_wr/iops/iops_rd/iops_wr");
+        return -1;
+    }
+
+    if (((bps != -1) && ((bps_rd != -1) || (bps_wr != -1)))
+        || ((iops != -1) && ((iops_rd != -1) || (iops_wr != -1)))) {
+        qerror_report(QERR_INVALID_PARAMETER_COMBINATION);
+        return -1;
+    }
+
+    if (bps != -1) {
+        bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL] = bps;
+        bs->io_limits.bps[BLOCK_IO_LIMIT_READ]  = 0;
+        bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE] = 0;
+    }
+
+    if ((bps_rd != -1) || (bps_wr != -1)) {
+        bs->io_limits.bps[BLOCK_IO_LIMIT_READ]   =
+           (bps_rd == -1) ? bs->io_limits.bps[BLOCK_IO_LIMIT_READ] : bps_rd;
+        bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
+           (bps_wr == -1) ? bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE] : bps_wr;
+        bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  = 0;
+    }
+
+    if (iops != -1) {
+        bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL] = iops;
+        bs->io_limits.iops[BLOCK_IO_LIMIT_READ]  = 0;
+        bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE] = 0;
+    }
+
+    if ((iops_rd != -1) || (iops_wr != -1)) {
+        bs->io_limits.iops[BLOCK_IO_LIMIT_READ]  =
+           (iops_rd == -1) ? bs->io_limits.iops[BLOCK_IO_LIMIT_READ] : iops_rd;
+        bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
+           (iops_wr == -1) ? bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE] : iops_wr;
+        bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL] = 0;
+    }
+
+    if (!bs->io_limits_enabled && bdrv_io_limits_enabled(bs)) {
+        bdrv_io_limits_enable(bs);
+    } else if (bs->io_limits_enabled && !bdrv_io_limits_enabled(bs)) {
+        bdrv_io_limits_disable(bs);
+    }
+
+    return 0;
+}
+
 int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
 {
     const char *id = qdict_get_str(qdict, "id");
diff --git a/blockdev.h b/blockdev.h
index 0a5144c..d0d0d77 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -63,6 +63,8 @@ int do_block_set_passwd(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_change_block(Monitor *mon, const char *device,
                     const char *filename, const char *fmt);
 int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
+int do_block_set_io_throttle(Monitor *mon,
+                    const QDict *qdict, QObject **ret_data);
 int do_snapshot_blkdev(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_resize(Monitor *mon, const QDict *qdict, QObject **ret_data);
 
diff --git a/hmp-commands.hx b/hmp-commands.hx
index aceba74..3ca496d 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -1210,6 +1210,21 @@ ETEXI
     },
 
 STEXI
+@item block_set_io_throttle @var{device} @var{bps} @var{bps_rd} @var{bps_wr} @var{iops} @var{iops_rd} @var{iops_wr}
+@findex block_set_io_throttle
+Change I/O throttle limits for a block drive to @var{bps} @var{bps_rd} @var{bps_wr} @var{iops} @var{iops_rd} @var{iops_wr}
+ETEXI
+
+    {
+        .name       = "block_set_io_throttle",
+        .args_type  = "device:B,bps:i?,bps_rd:i?,bps_wr:i?,iops:i?,iops_rd:i?,iops_wr:i?",
+        .params     = "device [bps] [bps_rd] [bps_wr] [iops] [iops_rd] [iops_wr]",
+        .help       = "change I/O throttle limts for a block drive",
+        .user_print = monitor_user_noop,
+        .mhandler.cmd_new = do_block_set_io_throttle,
+    },
+
+STEXI
 @item block_passwd @var{device} @var{password}
 @findex block_passwd
 Set the encrypted device @var{device} password to @var{password}
diff --git a/qerror.c b/qerror.c
index d7fcd93..3a21560 100644
--- a/qerror.c
+++ b/qerror.c
@@ -213,6 +213,10 @@ static const QErrorStringTable qerror_table[] = {
         .error_fmt = QERR_VNC_SERVER_FAILED,
         .desc      = "Could not start VNC server on %(target)",
     },
+    {
+        .error_fmt = QERR_INVALID_PARAMETER_COMBINATION,
+        .desc      = "Invalid paramter combination",
+    },
     {}
 };
 
diff --git a/qerror.h b/qerror.h
index 16c830d..81cfe79 100644
--- a/qerror.h
+++ b/qerror.h
@@ -181,4 +181,7 @@ QError *qobject_to_qerror(const QObject *obj);
 #define QERR_FEATURE_DISABLED \
     "{ 'class': 'FeatureDisabled', 'data': { 'name': %s } }"
 
+#define QERR_INVALID_PARAMETER_COMBINATION \
+    "{ 'class': 'InvalidParameterCombination', 'data': {} }"
+
 #endif /* QERROR_H */
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 92c5c3a..9b594cd 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -828,6 +828,44 @@ Example:
 EQMP
 
     {
+        .name       = "block_set_io_throttle",
+        .args_type  = "device:B,bps:i?,bps_rd:i?,bps_wr:i?,iops:i?,iops_rd:i?,iops_wr:i?",
+        .params     = "device [bps] [bps_rd] [bps_wr] [iops] [iops_rd] [iops_wr]",
+        .help       = "change I/O throttle limts for a block drive",
+        .user_print = monitor_user_noop,
+        .mhandler.cmd_new = do_block_set_io_throttle,
+    },
+
+SQMP
+block_set_io_throttle
+------------
+
+Change I/O throttle limits for a block drive.
+
+Arguments:
+
+- "device": device name (json-string)
+- "bps":  total throughput limit in bytes per second(json-int, optional)
+- "bps_rd":  read throughput limit in bytes per second(json-int, optional)
+- "bps_wr":  read throughput limit in bytes per second(json-int, optional)
+- "iops":  total I/O operations per second(json-int, optional)
+- "iops_rd":  read I/O operations per second(json-int, optional)
+- "iops_wr":  write I/O operations per second(json-int, optional)
+
+Example:
+
+-> { "execute": "block_set_io_throttle", "arguments": { "device": "virtio0",
+                                               "bps": "1000000",
+                                               "bps_rd": "0",
+                                               "bps_wr": "0",
+                                               "iops": "0",
+                                               "iops_rd": "0",
+                                               "iops_wr": "0" } }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "set_password",
         .args_type  = "protocol:s,password:s,connected:s?",
         .params     = "protocol password action-if-connected",
@@ -1082,6 +1120,12 @@ Each json-object contain the following:
                                 "tftp", "vdi", "vmdk", "vpc", "vvfat"
          - "backing_file": backing file name (json-string, optional)
          - "encrypted": true if encrypted, false otherwise (json-bool)
+         - "bps": limit total bytes per second (json-int)
+         - "bps_rd": limit read bytes per second (json-int)
+         - "bps_wr": limit write bytes per second (json-int)
+         - "iops": limit total I/O operations per second (json-int)
+         - "iops_rd": limit read operations per second (json-int)
+         - "iops_wr": limit write operations per second (json-int)
 
 Example:
 
@@ -1096,7 +1140,13 @@ Example:
                "ro":false,
                "drv":"qcow2",
                "encrypted":false,
-               "file":"disks/test.img"
+               "file":"disks/test.img",
+               "bps":1000000,
+               "bps_rd":0,
+               "bps_wr":0,
+               "iops":1000000,
+               "iops_rd":0,
+               "iops_wr":0,
             },
             "type":"unknown"
          },
-- 
1.7.2.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 2/4] block: add the block queue support Zhi Yong Wu
@ 2011-08-09  8:46   ` Ram Pai
  2011-08-09  8:53     ` Zhi Yong Wu
  2011-08-09 12:49   ` Stefan Hajnoczi
  1 sibling, 1 reply; 36+ messages in thread
From: Ram Pai @ 2011-08-09  8:46 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, zwu.kernel,
	ryanh, luowenj

On Tue, Aug 09, 2011 at 12:17:50PM +0800, Zhi Yong Wu wrote:
> The patch introduce one block queue for QEMU block layer.
> 
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  block/blk-queue.c |  141 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  block/blk-queue.h |   73 +++++++++++++++++++++++++++
>  2 files changed, 214 insertions(+), 0 deletions(-)
>  create mode 100644 block/blk-queue.c
>  create mode 100644 block/blk-queue.h
> 
> diff --git a/block/blk-queue.c b/block/blk-queue.c
> new file mode 100644
> index 0000000..f36f3e2
> --- /dev/null
> +++ b/block/blk-queue.c
> @@ -0,0 +1,141 @@
> +/*
> + * QEMU System Emulator queue definition for block layer
> + *
> + * Copyright (c) 2011 Zhi Yong Wu  <zwu.kernel@gmail.com>
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include "block_int.h"
> +#include "qemu-queue.h"
> +#include "block/blk-queue.h"
> +
> +/* The APIs for block request queue on qemu block layer.
> + */
> +
> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
> +{
> +    qemu_aio_release(acb);
> +}
> +
> +static AIOPool block_queue_pool = {
> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
> +    .cancel             = qemu_block_queue_cancel,
> +};
> +
> +static void qemu_block_queue_callback(void *opaque, int ret)
> +{
> +    BlockDriverAIOCB *acb = opaque;
> +
> +    qemu_aio_release(acb);
> +}
> +
> +BlockQueue *qemu_new_block_queue(void)
> +{
> +    BlockQueue *queue;
> +
> +    queue = qemu_mallocz(sizeof(BlockQueue));
> +
> +    QTAILQ_INIT(&queue->requests);
> +
> +    return queue;
> +}
> +
> +void qemu_del_block_queue(BlockQueue *queue)
> +{
> +    BlockIORequest *request, *next;
> +
> +    QTAILQ_FOREACH_SAFE(request, &queue->requests, entry, next) {
> +        QTAILQ_REMOVE(&queue->requests, request, entry);
> +        qemu_free(request);
> +    }
> +
> +    qemu_free(queue);
> +}
> +
> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
> +                        BlockDriverState *bs,
> +                        BlockRequestHandler *handler,
> +                        int64_t sector_num,
> +                        QEMUIOVector *qiov,
> +                        int nb_sectors,
> +                        BlockDriverCompletionFunc *cb,
> +                        void *opaque)
> +{
> +    BlockIORequest *request;
> +    BlockDriverAIOCB *acb;
> +
> +    request = qemu_malloc(sizeof(BlockIORequest));
> +    request->bs = bs;
> +    request->handler = handler;
> +    request->sector_num = sector_num;
> +    request->qiov = qiov;
> +    request->nb_sectors = nb_sectors;
> +    request->cb = cb;
> +    request->opaque = opaque;
> +
> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
> +
> +    acb = qemu_aio_get(&block_queue_pool, bs,
> +                       qemu_block_queue_callback, opaque);
> +
> +    request->acb = acb;
> +
> +    return acb;
> +}
> +
> +int qemu_block_queue_handler(BlockIORequest *request)
> +{
> +    int ret;
> +    BlockDriverAIOCB *res;
> +
> +    /* indicate this req is from block queue */
> +    request->bs->req_from_queue = true;
> +
> +    res = request->handler(request->bs, request->sector_num,
> +                           request->qiov, request->nb_sectors,
> +                           request->cb, request->opaque);


should'nt you return with a failure here if res == NULL?
Otherwise  qemu_block_queue_callback() gets called which 
will release the acb.


> +
> +    if (request->acb) {
> +        qemu_block_queue_callback(request->acb, 0);
> +    }
> +
> +    ret = (res == NULL) ? 0 : 1;
> +
> +    return ret;
> +}


RP

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-09  8:46   ` Ram Pai
@ 2011-08-09  8:53     ` Zhi Yong Wu
  0 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  8:53 UTC (permalink / raw)
  To: Ram Pai
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Tue, Aug 9, 2011 at 4:46 PM, Ram Pai <linuxram@us.ibm.com> wrote:
> On Tue, Aug 09, 2011 at 12:17:50PM +0800, Zhi Yong Wu wrote:
>> The patch introduce one block queue for QEMU block layer.
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  block/blk-queue.c |  141 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  block/blk-queue.h |   73 +++++++++++++++++++++++++++
>>  2 files changed, 214 insertions(+), 0 deletions(-)
>>  create mode 100644 block/blk-queue.c
>>  create mode 100644 block/blk-queue.h
>>
>> diff --git a/block/blk-queue.c b/block/blk-queue.c
>> new file mode 100644
>> index 0000000..f36f3e2
>> --- /dev/null
>> +++ b/block/blk-queue.c
>> @@ -0,0 +1,141 @@
>> +/*
>> + * QEMU System Emulator queue definition for block layer
>> + *
>> + * Copyright (c) 2011 Zhi Yong Wu  <zwu.kernel@gmail.com>
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include "block_int.h"
>> +#include "qemu-queue.h"
>> +#include "block/blk-queue.h"
>> +
>> +/* The APIs for block request queue on qemu block layer.
>> + */
>> +
>> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
>> +{
>> +    qemu_aio_release(acb);
>> +}
>> +
>> +static AIOPool block_queue_pool = {
>> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
>> +    .cancel             = qemu_block_queue_cancel,
>> +};
>> +
>> +static void qemu_block_queue_callback(void *opaque, int ret)
>> +{
>> +    BlockDriverAIOCB *acb = opaque;
>> +
>> +    qemu_aio_release(acb);
>> +}
>> +
>> +BlockQueue *qemu_new_block_queue(void)
>> +{
>> +    BlockQueue *queue;
>> +
>> +    queue = qemu_mallocz(sizeof(BlockQueue));
>> +
>> +    QTAILQ_INIT(&queue->requests);
>> +
>> +    return queue;
>> +}
>> +
>> +void qemu_del_block_queue(BlockQueue *queue)
>> +{
>> +    BlockIORequest *request, *next;
>> +
>> +    QTAILQ_FOREACH_SAFE(request, &queue->requests, entry, next) {
>> +        QTAILQ_REMOVE(&queue->requests, request, entry);
>> +        qemu_free(request);
>> +    }
>> +
>> +    qemu_free(queue);
>> +}
>> +
>> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>> +                        BlockDriverState *bs,
>> +                        BlockRequestHandler *handler,
>> +                        int64_t sector_num,
>> +                        QEMUIOVector *qiov,
>> +                        int nb_sectors,
>> +                        BlockDriverCompletionFunc *cb,
>> +                        void *opaque)
>> +{
>> +    BlockIORequest *request;
>> +    BlockDriverAIOCB *acb;
>> +
>> +    request = qemu_malloc(sizeof(BlockIORequest));
>> +    request->bs = bs;
>> +    request->handler = handler;
>> +    request->sector_num = sector_num;
>> +    request->qiov = qiov;
>> +    request->nb_sectors = nb_sectors;
>> +    request->cb = cb;
>> +    request->opaque = opaque;
>> +
>> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>> +
>> +    acb = qemu_aio_get(&block_queue_pool, bs,
>> +                       qemu_block_queue_callback, opaque);
>> +
>> +    request->acb = acb;
>> +
>> +    return acb;
>> +}
>> +
>> +int qemu_block_queue_handler(BlockIORequest *request)
>> +{
>> +    int ret;
>> +    BlockDriverAIOCB *res;
>> +
>> +    /* indicate this req is from block queue */
>> +    request->bs->req_from_queue = true;
>> +
>> +    res = request->handler(request->bs, request->sector_num,
>> +                           request->qiov, request->nb_sectors,
>> +                           request->cb, request->opaque);
>
>
> should'nt you return with a failure here if res == NULL?
> Otherwise  qemu_block_queue_callback() gets called which
> will release the acb.
Thanks, Pai. When req handler fails, acb should not be released. A
condition determination should be done before callback function is
invoked. It should be like as below:
    res = request->handler(request->bs, request->sector_num,
                           request->qiov, request->nb_sectors,
                           request->cb, request->opaque);

    ret = (res == NULL) ? 0 : 1;

    if ( ret && request->acb) {
        qemu_block_queue_callback(request->acb, 0);
    }

>
>
>> +
>> +    if (request->acb) {
>> +        qemu_block_queue_callback(request->acb, 0);
>> +    }
>> +
>> +    ret = (res == NULL) ? 0 : 1;
>> +
>> +    return ret;
>> +}
>
>
> RP
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm Zhi Yong Wu
@ 2011-08-09  8:57   ` Ram Pai
  2011-08-09  9:06     ` Zhi Yong Wu
  2011-08-12  5:35     ` Zhi Yong Wu
  2011-08-09 15:19   ` Stefan Hajnoczi
  1 sibling, 2 replies; 36+ messages in thread
From: Ram Pai @ 2011-08-09  8:57 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, zwu.kernel,
	ryanh, luowenj

On Tue, Aug 09, 2011 at 12:17:51PM +0800, Zhi Yong Wu wrote:
> Note:
>       1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>       2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits.
> 
> For these problems, if you have nice thought, pls let us know.:)
> 
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  block.c     |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  block.h     |    6 +-
>  block_int.h |   30 +++++
>  3 files changed, 372 insertions(+), 11 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 24a25d5..8fd6643 100644
> --- a/block.c
> +++ b/block.c
> @@ -29,6 +29,9 @@
>  #include "module.h"
>  #include "qemu-objects.h"
> 
> +#include "qemu-timer.h"
> +#include "block/blk-queue.h"
> +
>  #ifdef CONFIG_BSD
>  #include <sys/types.h>
>  #include <sys/stat.h>
> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>                           const uint8_t *buf, int nb_sectors);
> 
> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
> +        bool is_write, double elapsed_time, uint64_t *wait);
> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> +        double elapsed_time, uint64_t *wait);
> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
> +        bool is_write, uint64_t *wait);
> +
>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>      QTAILQ_HEAD_INITIALIZER(bdrv_states);
> 
> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>  }
>  #endif
> 
> +/* throttling disk I/O limits */
> +void bdrv_io_limits_disable(BlockDriverState *bs)
> +{
> +    bs->io_limits_enabled = false;
> +    bs->req_from_queue    = false;
> +
> +    if (bs->block_queue) {
> +        qemu_block_queue_flush(bs->block_queue);
> +        qemu_del_block_queue(bs->block_queue);
> +    }
> +
> +    if (bs->block_timer) {
> +        qemu_del_timer(bs->block_timer);
> +        qemu_free_timer(bs->block_timer);
> +    }
> +
> +    bs->slice_start[0]   = 0;
> +    bs->slice_start[1]   = 0;
> +
> +    bs->slice_end[0]     = 0;
> +    bs->slice_end[1]     = 0;
> +}
> +
> +static void bdrv_block_timer(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +    BlockQueue *queue = bs->block_queue;
> +
> +    qemu_block_queue_flush(queue);
> +}
> +
> +void bdrv_io_limits_enable(BlockDriverState *bs)
> +{
> +    bs->req_from_queue = false;
> +
> +    bs->block_queue    = qemu_new_block_queue();
> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
> +
> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
 
a minor comment. better to keep the slice_start of the both the READ and WRITE
side the same.  

    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_READ];

saves  a call to qemu_get_clock_ns().

> +
> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;

    bs->slice_end[BLOCK_IO_LIMIT_READ] = bs->slice_start[BLOCK_IO_LIMIT_READ] +
			BLOCK_IO_SLICE_TIME;

saves one more call to qemu_get_clock_ns()

> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;


    bs->slice_end[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_WRITE] +
			BLOCK_IO_SLICE_TIME;

yet another call saving.


> +}
> +
> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
> +{
> +    BlockIOLimit *io_limits = &bs->io_limits;
> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
> +        return false;
> +    }
> +
> +    return true;
> +}

can be optimized to:

	return (io_limits->bps[BLOCK_IO_LIMIT_READ]
		|| io_limits->bps[BLOCK_IO_LIMIT_WRITE]
		|| io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
		|| io_limits->iops[BLOCK_IO_LIMIT_READ]
		|| io_limits->iops[BLOCK_IO_LIMIT_WRITE]
		|| io_limits->iops[BLOCK_IO_LIMIT_TOTAL]);

> +
>  /* check if the path starts with "<protocol>:" */
>  static int path_has_protocol(const char *path)
>  {
> @@ -642,6 +714,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>              bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>      }
> 
> +    /* throttling disk I/O limits */
> +    if (bs->io_limits_enabled) {
> +        bdrv_io_limits_enable(bs);
> +    }
> +
>      return 0;
> 
>  unlink_and_fail:
> @@ -680,6 +757,16 @@ void bdrv_close(BlockDriverState *bs)
>          if (bs->change_cb)
>              bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>      }
> +
> +    /* throttling disk I/O limits */
> +    if (bs->block_queue) {
> +        qemu_del_block_queue(bs->block_queue);
> +    }
> +
> +    if (bs->block_timer) {
> +        qemu_del_timer(bs->block_timer);
> +        qemu_free_timer(bs->block_timer);
> +    }
>  }
> 
>  void bdrv_close_all(void)
> @@ -1312,6 +1399,15 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
>      *psecs = bs->secs;
>  }
> 
> +/* throttling disk io limits */
> +void bdrv_set_io_limits(BlockDriverState *bs,
> +                            BlockIOLimit *io_limits)
> +{
> +    memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
> +    bs->io_limits = *io_limits;
> +    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
> +}
> +
>  /* Recognize floppy formats */
>  typedef struct FDFormat {
>      FDriveType drive;
> @@ -1707,6 +1803,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
>                              qdict_get_bool(qdict, "ro"),
>                              qdict_get_str(qdict, "drv"),
>                              qdict_get_bool(qdict, "encrypted"));
> +
> +        monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
> +                            " bps_wr=%" PRId64 " iops=%" PRId64
> +                            " iops_rd=%" PRId64 " iops_wr=%" PRId64,
> +                            qdict_get_int(qdict, "bps"),
> +                            qdict_get_int(qdict, "bps_rd"),
> +                            qdict_get_int(qdict, "bps_wr"),
> +                            qdict_get_int(qdict, "iops"),
> +                            qdict_get_int(qdict, "iops_rd"),
> +                            qdict_get_int(qdict, "iops_wr"));
>      } else {
>          monitor_printf(mon, " [not inserted]");
>      }
> @@ -1739,10 +1845,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
>              QDict *bs_dict = qobject_to_qdict(bs_obj);
> 
>              obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
> -                                     "'encrypted': %i }",
> +                                     "'encrypted': %i, "
> +                                     "'bps': %" PRId64 ","
> +                                     "'bps_rd': %" PRId64 ","
> +                                     "'bps_wr': %" PRId64 ","
> +                                     "'iops': %" PRId64 ","
> +                                     "'iops_rd': %" PRId64 ","
> +                                     "'iops_wr': %" PRId64 "}",
>                                       bs->filename, bs->read_only,
>                                       bs->drv->format_name,
> -                                     bdrv_is_encrypted(bs));
> +                                     bdrv_is_encrypted(bs),
> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
>              if (bs->backing_file[0] != '\0') {
>                  QDict *qdict = qobject_to_qdict(obj);
>                  qdict_put(qdict, "backing_file",
> @@ -2111,6 +2229,165 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
>      return buf;
>  }
> 
> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
> +                 bool is_write, double elapsed_time, uint64_t *wait) {
> +    uint64_t bps_limit = 0;
> +    double   bytes_limit, bytes_disp, bytes_res;
> +    double   slice_time, wait_time;
> +
> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
> +        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
> +    } else if (bs->io_limits.bps[is_write]) {
> +        bps_limit = bs->io_limits.bps[is_write];
> +    } else {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
> +    bytes_limit = bps_limit * slice_time;
> +    bytes_disp  = bs->io_disps.bytes[is_write];
> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
> +        bytes_disp += bs->io_disps.bytes[!is_write];
> +    }
> +
> +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
> +
> +    if (bytes_disp + bytes_res <= bytes_limit) {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    /* Calc approx time to dispatch */
> +    wait_time = (bytes_disp + bytes_res) / bps_limit - elapsed_time;
> +
> +    if (wait) {
> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> +    }
> +
> +    return true;
> +}
> +
> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> +                             double elapsed_time, uint64_t *wait) {
> +    uint64_t iops_limit = 0;
> +    double   ios_limit, ios_disp;
> +    double   slice_time, wait_time;
> +
> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
> +        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
> +    } else if (bs->io_limits.iops[is_write]) {
> +        iops_limit = bs->io_limits.iops[is_write];
> +    } else {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
> +    ios_limit  = iops_limit * slice_time;
> +    ios_disp   = bs->io_disps.ios[is_write];
> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
> +        ios_disp += bs->io_disps.ios[!is_write];
> +    }
> +
> +    if (ios_disp + 1 <= ios_limit) {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    /* Calc approx time to dispatch */
> +    wait_time = (ios_disp + 1) / iops_limit;
> +    if (wait_time > elapsed_time) {
> +        wait_time = wait_time - elapsed_time;
> +    } else {
> +        wait_time = 0;
> +    }
> +
> +    if (wait) {
> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> +    }
> +
> +    return true;
> +}

bdrv_exceed_bps_limits() and bdrv_exceed_iops_limits() has almost similar logic.
Probably can be abstracted into a single function.

RP

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-09  8:57   ` Ram Pai
@ 2011-08-09  9:06     ` Zhi Yong Wu
  2011-08-12  5:35     ` Zhi Yong Wu
  1 sibling, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-09  9:06 UTC (permalink / raw)
  To: Ram Pai
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Tue, Aug 9, 2011 at 4:57 PM, Ram Pai <linuxram@us.ibm.com> wrote:
> On Tue, Aug 09, 2011 at 12:17:51PM +0800, Zhi Yong Wu wrote:
>> Note:
>>       1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>       2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits.
>>
>> For these problems, if you have nice thought, pls let us know.:)
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  block.c     |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  block.h     |    6 +-
>>  block_int.h |   30 +++++
>>  3 files changed, 372 insertions(+), 11 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 24a25d5..8fd6643 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -29,6 +29,9 @@
>>  #include "module.h"
>>  #include "qemu-objects.h"
>>
>> +#include "qemu-timer.h"
>> +#include "block/blk-queue.h"
>> +
>>  #ifdef CONFIG_BSD
>>  #include <sys/types.h>
>>  #include <sys/stat.h>
>> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>>                           const uint8_t *buf, int nb_sectors);
>>
>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>> +        bool is_write, double elapsed_time, uint64_t *wait);
>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>> +        double elapsed_time, uint64_t *wait);
>> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
>> +        bool is_write, uint64_t *wait);
>> +
>>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>>      QTAILQ_HEAD_INITIALIZER(bdrv_states);
>>
>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>  }
>>  #endif
>>
>> +/* throttling disk I/O limits */
>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>> +{
>> +    bs->io_limits_enabled = false;
>> +    bs->req_from_queue    = false;
>> +
>> +    if (bs->block_queue) {
>> +        qemu_block_queue_flush(bs->block_queue);
>> +        qemu_del_block_queue(bs->block_queue);
>> +    }
>> +
>> +    if (bs->block_timer) {
>> +        qemu_del_timer(bs->block_timer);
>> +        qemu_free_timer(bs->block_timer);
>> +    }
>> +
>> +    bs->slice_start[0]   = 0;
>> +    bs->slice_start[1]   = 0;
>> +
>> +    bs->slice_end[0]     = 0;
>> +    bs->slice_end[1]     = 0;
>> +}
>> +
>> +static void bdrv_block_timer(void *opaque)
>> +{
>> +    BlockDriverState *bs = opaque;
>> +    BlockQueue *queue = bs->block_queue;
>> +
>> +    qemu_block_queue_flush(queue);
>> +}
>> +
>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>> +{
>> +    bs->req_from_queue = false;
>> +
>> +    bs->block_queue    = qemu_new_block_queue();
>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>> +
>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>
> a minor comment. better to keep the slice_start of the both the READ and WRITE
> side the same.
>
>    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_READ];
>
> saves  a call to qemu_get_clock_ns().
>
>> +
>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>
>    bs->slice_end[BLOCK_IO_LIMIT_READ] = bs->slice_start[BLOCK_IO_LIMIT_READ] +
>                        BLOCK_IO_SLICE_TIME;
>
> saves one more call to qemu_get_clock_ns()
>
>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>
>
>    bs->slice_end[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_WRITE] +
>                        BLOCK_IO_SLICE_TIME;
>
> yet another call saving.
OK. thanks.
>
>
>> +}
>> +
>> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
>> +{
>> +    BlockIOLimit *io_limits = &bs->io_limits;
>> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
>> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
>> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>
> can be optimized to:
>
>        return (io_limits->bps[BLOCK_IO_LIMIT_READ]
>                || io_limits->bps[BLOCK_IO_LIMIT_WRITE]
>                || io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
>                || io_limits->iops[BLOCK_IO_LIMIT_READ]
>                || io_limits->iops[BLOCK_IO_LIMIT_WRITE]
>                || io_limits->iops[BLOCK_IO_LIMIT_TOTAL]);
OK. thanks.
>
>> +
>>  /* check if the path starts with "<protocol>:" */
>>  static int path_has_protocol(const char *path)
>>  {
>> @@ -642,6 +714,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>>              bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>>      }
>>
>> +    /* throttling disk I/O limits */
>> +    if (bs->io_limits_enabled) {
>> +        bdrv_io_limits_enable(bs);
>> +    }
>> +
>>      return 0;
>>
>>  unlink_and_fail:
>> @@ -680,6 +757,16 @@ void bdrv_close(BlockDriverState *bs)
>>          if (bs->change_cb)
>>              bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>>      }
>> +
>> +    /* throttling disk I/O limits */
>> +    if (bs->block_queue) {
>> +        qemu_del_block_queue(bs->block_queue);
>> +    }
>> +
>> +    if (bs->block_timer) {
>> +        qemu_del_timer(bs->block_timer);
>> +        qemu_free_timer(bs->block_timer);
>> +    }
>>  }
>>
>>  void bdrv_close_all(void)
>> @@ -1312,6 +1399,15 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
>>      *psecs = bs->secs;
>>  }
>>
>> +/* throttling disk io limits */
>> +void bdrv_set_io_limits(BlockDriverState *bs,
>> +                            BlockIOLimit *io_limits)
>> +{
>> +    memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
>> +    bs->io_limits = *io_limits;
>> +    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
>> +}
>> +
>>  /* Recognize floppy formats */
>>  typedef struct FDFormat {
>>      FDriveType drive;
>> @@ -1707,6 +1803,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
>>                              qdict_get_bool(qdict, "ro"),
>>                              qdict_get_str(qdict, "drv"),
>>                              qdict_get_bool(qdict, "encrypted"));
>> +
>> +        monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
>> +                            " bps_wr=%" PRId64 " iops=%" PRId64
>> +                            " iops_rd=%" PRId64 " iops_wr=%" PRId64,
>> +                            qdict_get_int(qdict, "bps"),
>> +                            qdict_get_int(qdict, "bps_rd"),
>> +                            qdict_get_int(qdict, "bps_wr"),
>> +                            qdict_get_int(qdict, "iops"),
>> +                            qdict_get_int(qdict, "iops_rd"),
>> +                            qdict_get_int(qdict, "iops_wr"));
>>      } else {
>>          monitor_printf(mon, " [not inserted]");
>>      }
>> @@ -1739,10 +1845,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
>>              QDict *bs_dict = qobject_to_qdict(bs_obj);
>>
>>              obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
>> -                                     "'encrypted': %i }",
>> +                                     "'encrypted': %i, "
>> +                                     "'bps': %" PRId64 ","
>> +                                     "'bps_rd': %" PRId64 ","
>> +                                     "'bps_wr': %" PRId64 ","
>> +                                     "'iops': %" PRId64 ","
>> +                                     "'iops_rd': %" PRId64 ","
>> +                                     "'iops_wr': %" PRId64 "}",
>>                                       bs->filename, bs->read_only,
>>                                       bs->drv->format_name,
>> -                                     bdrv_is_encrypted(bs));
>> +                                     bdrv_is_encrypted(bs),
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
>>              if (bs->backing_file[0] != '\0') {
>>                  QDict *qdict = qobject_to_qdict(obj);
>>                  qdict_put(qdict, "backing_file",
>> @@ -2111,6 +2229,165 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
>>      return buf;
>>  }
>>
>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>> +                 bool is_write, double elapsed_time, uint64_t *wait) {
>> +    uint64_t bps_limit = 0;
>> +    double   bytes_limit, bytes_disp, bytes_res;
>> +    double   slice_time, wait_time;
>> +
>> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
>> +        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
>> +    } else if (bs->io_limits.bps[is_write]) {
>> +        bps_limit = bs->io_limits.bps[is_write];
>> +    } else {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
>> +    bytes_limit = bps_limit * slice_time;
>> +    bytes_disp  = bs->io_disps.bytes[is_write];
>> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
>> +        bytes_disp += bs->io_disps.bytes[!is_write];
>> +    }
>> +
>> +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>> +
>> +    if (bytes_disp + bytes_res <= bytes_limit) {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    /* Calc approx time to dispatch */
>> +    wait_time = (bytes_disp + bytes_res) / bps_limit - elapsed_time;
>> +
>> +    if (wait) {
>> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>> +                             double elapsed_time, uint64_t *wait) {
>> +    uint64_t iops_limit = 0;
>> +    double   ios_limit, ios_disp;
>> +    double   slice_time, wait_time;
>> +
>> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
>> +    } else if (bs->io_limits.iops[is_write]) {
>> +        iops_limit = bs->io_limits.iops[is_write];
>> +    } else {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
>> +    ios_limit  = iops_limit * slice_time;
>> +    ios_disp   = bs->io_disps.ios[is_write];
>> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +        ios_disp += bs->io_disps.ios[!is_write];
>> +    }
>> +
>> +    if (ios_disp + 1 <= ios_limit) {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    /* Calc approx time to dispatch */
>> +    wait_time = (ios_disp + 1) / iops_limit;
>> +    if (wait_time > elapsed_time) {
>> +        wait_time = wait_time - elapsed_time;
>> +    } else {
>> +        wait_time = 0;
>> +    }
>> +
>> +    if (wait) {
>> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>> +    }
>> +
>> +    return true;
>> +}
>
> bdrv_exceed_bps_limits() and bdrv_exceed_iops_limits() has almost similar logic.
> Probably can be abstracted into a single function.
I ever attempted to consolidate them, but found that it will also not
save LOC and cause the bad readability of this code; So i gave up.
>
> RP
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling
  2011-08-09  4:17 [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Zhi Yong Wu
                   ` (3 preceding siblings ...)
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 4/4] qmp/hmp: add block_set_io_throttle Zhi Yong Wu
@ 2011-08-09 12:08 ` Stefan Hajnoczi
  2011-08-10  5:09   ` Zhi Yong Wu
  4 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 12:08 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, zwu.kernel,
	ryanh, luowenj

On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>  Makefile.objs     |    2 +-
>  block.c           |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  block.h           |    6 +-
>  block/blk-queue.c |  141 ++++++++++++++++++++++
>  block/blk-queue.h |   73 +++++++++++
>  block_int.h       |   30 +++++
>  blockdev.c        |  108 +++++++++++++++++
>  blockdev.h        |    2 +
>  hmp-commands.hx   |   15 +++
>  qemu-config.c     |   24 ++++
>  qemu-option.c     |   17 +++
>  qemu-option.h     |    1 +
>  qemu-options.hx   |    1 +
>  qerror.c          |    4 +
>  qerror.h          |    3 +
>  qmp-commands.hx   |   52 ++++++++-
>  16 files changed, 813 insertions(+), 13 deletions(-)
>  create mode 100644 block/blk-queue.c
>  create mode 100644 block/blk-queue.h

This series is against qemu-kvm.git, please send block patches against
qemu.git/master in the future.  There is no qemu-kvm.git-specific code
here so just working against qemu.git and emailing qemu-devel is best.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/4] block: add the command line support
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 1/4] block: add the command line support Zhi Yong Wu
@ 2011-08-09 12:25   ` Stefan Hajnoczi
  2011-08-10  5:20     ` Zhi Yong Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 12:25 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, zwu.kernel,
	ryanh, luowenj

On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> ---
>  Makefile.objs   |    2 +-
>  blockdev.c      |   39 +++++++++++++++++++++++++++++++++++++++
>  qemu-config.c   |   24 ++++++++++++++++++++++++
>  qemu-option.c   |   17 +++++++++++++++++
>  qemu-option.h   |    1 +
>  qemu-options.hx |    1 +
>  6 files changed, 83 insertions(+), 1 deletions(-)
>
> diff --git a/Makefile.objs b/Makefile.objs
> index 9f99ed4..06f2033 100644
> --- a/Makefile.objs
> +++ b/Makefile.objs
> @@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
>  block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>  block-nested-y += qed-check.o
> -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
> +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o

This does not build:
  LINK  qemu-ga
gcc: error: block/blk-queue.o: No such file or directory

This Makefile.objs change should be in the commit that adds blk-queue.c.

Each patch in a series should compile cleanly and can only depend on
previous patches.  This is important so that git-bisect(1) can be
used, it only works if every commit builds a working program.  It also
makes patch review easier when the patch series builds up logically.

>  block-nested-$(CONFIG_WIN32) += raw-win32.o
>  block-nested-$(CONFIG_POSIX) += raw-posix.o
>  block-nested-$(CONFIG_CURL) += curl.o
> diff --git a/blockdev.c b/blockdev.c
> index c263663..9c78548 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -238,6 +238,10 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>     int on_read_error, on_write_error;
>     const char *devaddr;
>     DriveInfo *dinfo;
> +    BlockIOLimit io_limits;

This structure is not undefined at this point in the patch series.

> +    bool iol_flag = false;
> +    const char *iol_opts[7] = {"bps", "bps_rd", "bps_wr",
> +                                "iops", "iops_rd", "iops_wr"};
>     int is_extboot = 0;
>     int snapshot = 0;
>     int ret;
> @@ -372,6 +376,36 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>         return NULL;
>     }
>
> +    /* disk io throttling */
> +    iol_flag = qemu_opt_io_limits_enable_flag(opts, iol_opts);
> +    if (iol_flag) {
> +        memset(&io_limits, 0, sizeof(BlockIOLimit));
> +
> +        io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
> +                           qemu_opt_get_number(opts, "bps", 0);
> +        io_limits.bps[BLOCK_IO_LIMIT_READ]   =
> +                           qemu_opt_get_number(opts, "bps_rd", 0);
> +        io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
> +                           qemu_opt_get_number(opts, "bps_wr", 0);
> +        io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
> +                           qemu_opt_get_number(opts, "iops", 0);
> +        io_limits.iops[BLOCK_IO_LIMIT_READ]  =
> +                           qemu_opt_get_number(opts, "iops_rd", 0);
> +        io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
> +                           qemu_opt_get_number(opts, "iops_wr", 0);
> +
> +        if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
> +            && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
> +            || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
> +            || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
> +            && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
> +            || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
> +            error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) \
> +                                        cannot be used at the same time");
> +            return NULL;
> +        }
> +    }
> +
>     on_write_error = BLOCK_ERR_STOP_ENOSPC;
>     if ((buf = qemu_opt_get(opts, "werror")) != NULL) {
>         if (type != IF_IDE && type != IF_SCSI && type != IF_VIRTIO && type != IF_NONE) {
> @@ -483,6 +517,11 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>
>     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
>
> +    /* disk I/O throttling */
> +    if (iol_flag) {
> +        bdrv_set_io_limits(dinfo->bdrv, &io_limits);
> +    }

iol_flag and qemu_opt_io_limits_enable_flag() are not necessary.  If
no limits were set then all fields will be 0 (unlimited).

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 2/4] block: add the block queue support Zhi Yong Wu
  2011-08-09  8:46   ` Ram Pai
@ 2011-08-09 12:49   ` Stefan Hajnoczi
  2011-08-10  5:54     ` Zhi Yong Wu
  2011-08-12  8:10     ` Zhi Yong Wu
  1 sibling, 2 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 12:49 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, zwu.kernel,
	ryanh, luowenj

On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> +/* The APIs for block request queue on qemu block layer.
> + */
> +
> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
> +{
> +    qemu_aio_release(acb);
> +}
> +
> +static AIOPool block_queue_pool = {
> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
> +    .cancel             = qemu_block_queue_cancel,
> +};

The lifecycle of block_queue_pool acbs should be like this:

When a request is queued we need a BlockQueue acb because we have no
real acb yet.  So we get an acb from block_queue_pool.

If the acb is cancelled before being dispatched we need to release the
acb and remove the request from the queue.  (This patch does not
remove the request from the queue on cancel.)

When the acb is dispatched we need to keep track of the real acb (e.g.
from qcow2).  The caller will only ever see our acb because there is
no way to give them the pointer to the new acb after dispatch.  That
means we need to keep the our acb alive for the entire lifetime of the
request.  (This patch currently releases our acb when the request is
dispatched.)

If the acb is cancelled after being dispatched we need to first cancel
the real acb and then release our acb.

When the acb is dispatched we need to pass qemu_block_queue_callback
as the cb function to handler.  Inside qemu_block_queue_callback we
invoke the request cb and then release our acb.  This is how our acb
should be freed.

> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
> +                        BlockDriverState *bs,
> +                        BlockRequestHandler *handler,
> +                        int64_t sector_num,
> +                        QEMUIOVector *qiov,
> +                        int nb_sectors,
> +                        BlockDriverCompletionFunc *cb,
> +                        void *opaque)
> +{
> +    BlockIORequest *request;
> +    BlockDriverAIOCB *acb;
> +
> +    request = qemu_malloc(sizeof(BlockIORequest));
> +    request->bs = bs;
> +    request->handler = handler;
> +    request->sector_num = sector_num;
> +    request->qiov = qiov;
> +    request->nb_sectors = nb_sectors;
> +    request->cb = cb;
> +    request->opaque = opaque;
> +
> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);

It would be simpler to define BlockQueueAIOCB and using it as our acb
instead of managing an extra BlockIORequest structure.  That way you
don't need to worry about extra mallocs and frees.

> +
> +    acb = qemu_aio_get(&block_queue_pool, bs,
> +                       qemu_block_queue_callback, opaque);
> +
> +    request->acb = acb;
> +
> +    return acb;
> +}
> +
> +int qemu_block_queue_handler(BlockIORequest *request)

This function can be static.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm Zhi Yong Wu
  2011-08-09  8:57   ` Ram Pai
@ 2011-08-09 15:19   ` Stefan Hajnoczi
  2011-08-10  6:57     ` Zhi Yong Wu
  1 sibling, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-09 15:19 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, zwu.kernel,
	ryanh, luowenj

On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> Note:
>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.

If an I/O request is larger than the limit itself then I think we
should let it through with a warning that the limit is too low.  This
reflects the fact that we don't split I/O requests into smaller
requests and the guest may give us 128 KB (or larger?) requests.  In
practice the lowest feasible limit is probably 256 KB or 512 KB.

> diff --git a/block.c b/block.c
> index 24a25d5..8fd6643 100644
> --- a/block.c
> +++ b/block.c
> @@ -29,6 +29,9 @@
>  #include "module.h"
>  #include "qemu-objects.h"
>
> +#include "qemu-timer.h"
> +#include "block/blk-queue.h"
> +
>  #ifdef CONFIG_BSD
>  #include <sys/types.h>
>  #include <sys/stat.h>
> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>                          const uint8_t *buf, int nb_sectors);
>
> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
> +        bool is_write, double elapsed_time, uint64_t *wait);
> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> +        double elapsed_time, uint64_t *wait);
> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
> +        bool is_write, uint64_t *wait);
> +
>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>     QTAILQ_HEAD_INITIALIZER(bdrv_states);
>
> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>  }
>  #endif
>
> +/* throttling disk I/O limits */
> +void bdrv_io_limits_disable(BlockDriverState *bs)
> +{
> +    bs->io_limits_enabled = false;
> +    bs->req_from_queue    = false;
> +
> +    if (bs->block_queue) {
> +        qemu_block_queue_flush(bs->block_queue);
> +        qemu_del_block_queue(bs->block_queue);

When you fix the acb lifecycle in block-queue.c this will no longer be
safe.  Requests that are dispatched still have acbs that belong to
this queue.  It is simplest to keep the queue for the lifetime of the
BlockDriverState - it's just a list so it doesn't take up much memory.

> +    }
> +
> +    if (bs->block_timer) {
> +        qemu_del_timer(bs->block_timer);
> +        qemu_free_timer(bs->block_timer);

To prevent double frees:

bs->block_timer = NULL;

> +    }
> +
> +    bs->slice_start[0]   = 0;
> +    bs->slice_start[1]   = 0;
> +
> +    bs->slice_end[0]     = 0;
> +    bs->slice_end[1]     = 0;
> +}
> +
> +static void bdrv_block_timer(void *opaque)
> +{
> +    BlockDriverState *bs = opaque;
> +    BlockQueue *queue = bs->block_queue;
> +
> +    qemu_block_queue_flush(queue);
> +}
> +
> +void bdrv_io_limits_enable(BlockDriverState *bs)
> +{
> +    bs->req_from_queue = false;
> +
> +    bs->block_queue    = qemu_new_block_queue();
> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
> +
> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
> +
> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;

The slice times could just be initialized to 0 here.  When
bdrv_exceed_io_limits() is called it will start a new slice.

> +}
> +
> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
> +{
> +    BlockIOLimit *io_limits = &bs->io_limits;
> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
>  /* check if the path starts with "<protocol>:" */
>  static int path_has_protocol(const char *path)
>  {
> @@ -642,6 +714,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>             bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>     }
>
> +    /* throttling disk I/O limits */
> +    if (bs->io_limits_enabled) {
> +        bdrv_io_limits_enable(bs);
> +    }
> +
>     return 0;
>
>  unlink_and_fail:
> @@ -680,6 +757,16 @@ void bdrv_close(BlockDriverState *bs)
>         if (bs->change_cb)
>             bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>     }
> +
> +    /* throttling disk I/O limits */
> +    if (bs->block_queue) {
> +        qemu_del_block_queue(bs->block_queue);
> +    }
> +
> +    if (bs->block_timer) {
> +        qemu_del_timer(bs->block_timer);
> +        qemu_free_timer(bs->block_timer);
> +    }
>  }
>
>  void bdrv_close_all(void)
> @@ -1312,6 +1399,15 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
>     *psecs = bs->secs;
>  }
>
> +/* throttling disk io limits */
> +void bdrv_set_io_limits(BlockDriverState *bs,
> +                            BlockIOLimit *io_limits)
> +{
> +    memset(&bs->io_limits, 0, sizeof(BlockIOLimit));

The assignment from *io_limits overwrites all of bs->io_limits, the
memset() can be removed.

> +    bs->io_limits = *io_limits;
> +    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
> +}
> +
>  /* Recognize floppy formats */
>  typedef struct FDFormat {
>     FDriveType drive;
> @@ -1707,6 +1803,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
>                             qdict_get_bool(qdict, "ro"),
>                             qdict_get_str(qdict, "drv"),
>                             qdict_get_bool(qdict, "encrypted"));
> +
> +        monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
> +                            " bps_wr=%" PRId64 " iops=%" PRId64
> +                            " iops_rd=%" PRId64 " iops_wr=%" PRId64,
> +                            qdict_get_int(qdict, "bps"),
> +                            qdict_get_int(qdict, "bps_rd"),
> +                            qdict_get_int(qdict, "bps_wr"),
> +                            qdict_get_int(qdict, "iops"),
> +                            qdict_get_int(qdict, "iops_rd"),
> +                            qdict_get_int(qdict, "iops_wr"));
>     } else {
>         monitor_printf(mon, " [not inserted]");
>     }
> @@ -1739,10 +1845,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
>             QDict *bs_dict = qobject_to_qdict(bs_obj);
>
>             obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
> -                                     "'encrypted': %i }",
> +                                     "'encrypted': %i, "
> +                                     "'bps': %" PRId64 ","
> +                                     "'bps_rd': %" PRId64 ","
> +                                     "'bps_wr': %" PRId64 ","
> +                                     "'iops': %" PRId64 ","
> +                                     "'iops_rd': %" PRId64 ","
> +                                     "'iops_wr': %" PRId64 "}",
>                                      bs->filename, bs->read_only,
>                                      bs->drv->format_name,
> -                                     bdrv_is_encrypted(bs));
> +                                     bdrv_is_encrypted(bs),
> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
>             if (bs->backing_file[0] != '\0') {
>                 QDict *qdict = qobject_to_qdict(obj);
>                 qdict_put(qdict, "backing_file",
> @@ -2111,6 +2229,165 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
>     return buf;
>  }
>
> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
> +                 bool is_write, double elapsed_time, uint64_t *wait) {
> +    uint64_t bps_limit = 0;
> +    double   bytes_limit, bytes_disp, bytes_res;
> +    double   slice_time, wait_time;
> +
> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
> +        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
> +    } else if (bs->io_limits.bps[is_write]) {
> +        bps_limit = bs->io_limits.bps[is_write];
> +    } else {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
> +    bytes_limit = bps_limit * slice_time;
> +    bytes_disp  = bs->io_disps.bytes[is_write];
> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
> +        bytes_disp += bs->io_disps.bytes[!is_write];
> +    }
> +
> +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
> +
> +    if (bytes_disp + bytes_res <= bytes_limit) {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    /* Calc approx time to dispatch */
> +    wait_time = (bytes_disp + bytes_res) / bps_limit - elapsed_time;
> +
> +    if (wait) {
> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> +    }
> +
> +    return true;
> +}
> +
> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> +                             double elapsed_time, uint64_t *wait) {
> +    uint64_t iops_limit = 0;
> +    double   ios_limit, ios_disp;
> +    double   slice_time, wait_time;
> +
> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
> +        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
> +    } else if (bs->io_limits.iops[is_write]) {
> +        iops_limit = bs->io_limits.iops[is_write];
> +    } else {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
> +    ios_limit  = iops_limit * slice_time;
> +    ios_disp   = bs->io_disps.ios[is_write];
> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
> +        ios_disp += bs->io_disps.ios[!is_write];
> +    }
> +
> +    if (ios_disp + 1 <= ios_limit) {
> +        if (wait) {
> +            *wait = 0;
> +        }
> +
> +        return false;
> +    }
> +
> +    /* Calc approx time to dispatch */
> +    wait_time = (ios_disp + 1) / iops_limit;
> +    if (wait_time > elapsed_time) {
> +        wait_time = wait_time - elapsed_time;
> +    } else {
> +        wait_time = 0;
> +    }
> +
> +    if (wait) {
> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> +    }
> +
> +    return true;
> +}
> +
> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
> +                           bool is_write, uint64_t *wait) {
> +    int64_t  real_time, real_slice;
> +    uint64_t bps_wait = 0, iops_wait = 0, max_wait;
> +    double   elapsed_time;
> +    int      bps_ret, iops_ret;
> +
> +    real_time = qemu_get_clock_ns(vm_clock);

Please call it 'now' instead of 'real_time'.  There is a "real-time
clock" called rt_clock and this variable name could be confusing.

> +    real_slice = bs->slice_end[is_write] - bs->slice_start[is_write];
> +    if ((bs->slice_start[is_write] < real_time)
> +        && (bs->slice_end[is_write] > real_time)) {
> +        bs->slice_end[is_write]   = real_time + BLOCK_IO_SLICE_TIME;
> +    } else {
> +        bs->slice_start[is_write]     = real_time;
> +        bs->slice_end[is_write]       = real_time + BLOCK_IO_SLICE_TIME;
> +
> +        bs->io_disps.bytes[is_write]  = 0;
> +        bs->io_disps.bytes[!is_write] = 0;
> +
> +        bs->io_disps.ios[is_write]    = 0;
> +        bs->io_disps.ios[!is_write]   = 0;
> +    }
> +
> +    /* If a limit was exceeded, immediately queue this request */
> +    if (!bs->req_from_queue
> +        && !QTAILQ_EMPTY(&bs->block_queue->requests)) {

bs->req_from_queue seems to be a way to prevent requests
unconditionally being queued during flush.  Please define a BlockQueue
interface instead, something like this:

/* If a limit was exceeded, immediately queue this request */
if (qemu_block_queue_has_pending(bs->block_queue)) {

Then req_from_queue can be hidden inside BlockQueue.  I'd rename it to
bool flushing.  qemu_block_queue_flush() will set it to true at the
start of the function and false at the end of the function.

> +        if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]
> +            || bs->io_limits.bps[is_write] || bs->io_limits.iops[is_write]
> +            || bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
> +            if (wait) {
> +                *wait = 0;

No estimated wait time?  That's okay only if we stop setting the timer
whenever a request is queued - more comments about this below.

> +            }
> +
> +            return true;
> +        }
> +    }
> +
> +    elapsed_time  = real_time - bs->slice_start[is_write];
> +    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);

Why * 10.0?

> +
> +    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
> +                                      is_write, elapsed_time, &bps_wait);
> +    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
> +                                      elapsed_time, &iops_wait);
> +    if (bps_ret || iops_ret) {
> +        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
> +        if (wait) {
> +            *wait = max_wait;
> +        }
> +
> +        real_time = qemu_get_clock_ns(vm_clock);
> +        if (bs->slice_end[is_write] < real_time + max_wait) {
> +            bs->slice_end[is_write] = real_time + max_wait;
> +        }

Why is this necessary?  We have already extended the slice at the top
of the function.

> +
> +        return true;
> +    }
> +
> +    if (wait) {
> +        *wait = 0;
> +    }
> +
> +    return false;
> +}
>
>  /**************************************************************/
>  /* async I/Os */
> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>  {
>     BlockDriver *drv = bs->drv;
>     BlockDriverAIOCB *ret;
> +    uint64_t wait_time = 0;
>
>     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>
> -    if (!drv)
> -        return NULL;
> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
> +    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
> +        if (bs->io_limits_enabled) {
> +            bs->req_from_queue = false;
> +        }
>         return NULL;
> +    }
> +
> +    /* throttling disk read I/O */
> +    if (bs->io_limits_enabled) {
> +        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
> +                              sector_num, qiov, nb_sectors, cb, opaque);
> +            qemu_mod_timer(bs->block_timer,
> +                       wait_time + qemu_get_clock_ns(vm_clock));

If a guest keeps sending I/O requests, say every millisecond, then we
will delay dispatch forever.  In practice we hope the storage
interface (virtio-blk, LSI SCSI, etc) has a resource limit that
prevents this.  But the point remains that this delays requests that
should be dispatched for too long.  Once the timer has been set we
should not set it again.

> +            bs->req_from_queue = false;
> +            return ret;
> +        }
> +    }
>
>     ret = drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
>                               cb, opaque);
> @@ -2136,6 +2428,16 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>        /* Update stats even though technically transfer has not happened. */
>        bs->rd_bytes += (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>        bs->rd_ops ++;
> +
> +        if (bs->io_limits_enabled) {
> +            bs->io_disps.bytes[BLOCK_IO_LIMIT_READ] +=
> +                              (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
> +            bs->io_disps.ios[BLOCK_IO_LIMIT_READ]++;
> +        }
> +    }
> +
> +    if (bs->io_limits_enabled) {
> +        bs->req_from_queue = false;
>     }
>
>     return ret;
> @@ -2184,15 +2486,18 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>     BlockDriver *drv = bs->drv;
>     BlockDriverAIOCB *ret;
>     BlockCompleteData *blk_cb_data;
> +    uint64_t wait_time = 0;
>
>     trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
>
> -    if (!drv)
> -        return NULL;
> -    if (bs->read_only)
> -        return NULL;
> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
> +    if (!drv || bs->read_only
> +        || bdrv_check_request(bs, sector_num, nb_sectors)) {
> +        if (bs->io_limits_enabled) {
> +            bs->req_from_queue = false;
> +        }
> +
>         return NULL;
> +    }
>
>     if (bs->dirty_bitmap) {
>         blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
> @@ -2201,6 +2506,18 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>         opaque = blk_cb_data;
>     }
>
> +    /* throttling disk write I/O */
> +    if (bs->io_limits_enabled) {
> +        if (bdrv_exceed_io_limits(bs, nb_sectors, true, &wait_time)) {
> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_writev,
> +                                  sector_num, qiov, nb_sectors, cb, opaque);
> +            qemu_mod_timer(bs->block_timer,
> +                                  wait_time + qemu_get_clock_ns(vm_clock));
> +            bs->req_from_queue = false;
> +            return ret;
> +        }
> +    }
> +
>     ret = drv->bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
>                                cb, opaque);
>
> @@ -2211,6 +2528,16 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>         if (bs->wr_highest_sector < sector_num + nb_sectors - 1) {
>             bs->wr_highest_sector = sector_num + nb_sectors - 1;
>         }
> +
> +        if (bs->io_limits_enabled) {
> +            bs->io_disps.bytes[BLOCK_IO_LIMIT_WRITE] +=
> +                               (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
> +            bs->io_disps.ios[BLOCK_IO_LIMIT_WRITE]++;
> +        }
> +    }
> +
> +    if (bs->io_limits_enabled) {
> +        bs->req_from_queue = false;
>     }
>
>     return ret;
> diff --git a/block.h b/block.h
> index 859d1d9..3d02902 100644
> --- a/block.h
> +++ b/block.h
> @@ -57,6 +57,11 @@ void bdrv_info(Monitor *mon, QObject **ret_data);
>  void bdrv_stats_print(Monitor *mon, const QObject *data);
>  void bdrv_info_stats(Monitor *mon, QObject **ret_data);
>
> +/* disk I/O throttling */
> +void bdrv_io_limits_enable(BlockDriverState *bs);
> +void bdrv_io_limits_disable(BlockDriverState *bs);
> +bool bdrv_io_limits_enabled(BlockDriverState *bs);
> +
>  void bdrv_init(void);
>  void bdrv_init_with_whitelist(void);
>  BlockDriver *bdrv_find_protocol(const char *filename);
> @@ -97,7 +102,6 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>     const char *backing_file, const char *backing_fmt);
>  void bdrv_register(BlockDriver *bdrv);
>
> -
>  typedef struct BdrvCheckResult {
>     int corruptions;
>     int leaks;
> diff --git a/block_int.h b/block_int.h
> index 1e265d2..a4cd458 100644
> --- a/block_int.h
> +++ b/block_int.h
> @@ -27,10 +27,17 @@
>  #include "block.h"
>  #include "qemu-option.h"
>  #include "qemu-queue.h"
> +#include "block/blk-queue.h"
>
>  #define BLOCK_FLAG_ENCRYPT     1
>  #define BLOCK_FLAG_COMPAT6     4
>
> +#define BLOCK_IO_LIMIT_READ     0
> +#define BLOCK_IO_LIMIT_WRITE    1
> +#define BLOCK_IO_LIMIT_TOTAL    2
> +
> +#define BLOCK_IO_SLICE_TIME     100000000

Please add a comment indicating the units: /* nanoseconds */

> +
>  #define BLOCK_OPT_SIZE          "size"
>  #define BLOCK_OPT_ENCRYPT       "encryption"
>  #define BLOCK_OPT_COMPAT6       "compat6"
> @@ -46,6 +53,16 @@ typedef struct AIOPool {
>     BlockDriverAIOCB *free_aiocb;
>  } AIOPool;
>
> +typedef struct BlockIOLimit {
> +    uint64_t bps[3];
> +    uint64_t iops[3];
> +} BlockIOLimit;
> +
> +typedef struct BlockIODisp {
> +    uint64_t bytes[2];
> +    uint64_t ios[2];
> +} BlockIODisp;
> +
>  struct BlockDriver {
>     const char *format_name;
>     int instance_size;
> @@ -175,6 +192,16 @@ struct BlockDriverState {
>
>     void *sync_aiocb;
>
> +    /* the time for latest disk I/O */
> +    int64_t slice_start[2];
> +    int64_t slice_end[2];
> +    BlockIOLimit io_limits;
> +    BlockIODisp  io_disps;
> +    BlockQueue   *block_queue;
> +    QEMUTimer    *block_timer;
> +    bool         io_limits_enabled;
> +    bool         req_from_queue;
> +
>     /* I/O stats (display with "info blockstats"). */
>     uint64_t rd_bytes;
>     uint64_t wr_bytes;
> @@ -222,6 +249,9 @@ void qemu_aio_release(void *p);
>
>  void *qemu_blockalign(BlockDriverState *bs, size_t size);
>
> +void bdrv_set_io_limits(BlockDriverState *bs,
> +                            BlockIOLimit *io_limits);
> +
>  #ifdef _WIN32
>  int is_windows_drive(const char *filename);
>  #endif
> --
> 1.7.2.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling
  2011-08-09 12:08 ` [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Stefan Hajnoczi
@ 2011-08-10  5:09   ` Zhi Yong Wu
  2011-08-10  9:39     ` Stefan Hajnoczi
  0 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-10  5:09 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Tue, Aug 9, 2011 at 8:08 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>  Makefile.objs     |    2 +-
>>  block.c           |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  block.h           |    6 +-
>>  block/blk-queue.c |  141 ++++++++++++++++++++++
>>  block/blk-queue.h |   73 +++++++++++
>>  block_int.h       |   30 +++++
>>  blockdev.c        |  108 +++++++++++++++++
>>  blockdev.h        |    2 +
>>  hmp-commands.hx   |   15 +++
>>  qemu-config.c     |   24 ++++
>>  qemu-option.c     |   17 +++
>>  qemu-option.h     |    1 +
>>  qemu-options.hx   |    1 +
>>  qerror.c          |    4 +
>>  qerror.h          |    3 +
>>  qmp-commands.hx   |   52 ++++++++-
>>  16 files changed, 813 insertions(+), 13 deletions(-)
>>  create mode 100644 block/blk-queue.c
>>  create mode 100644 block/blk-queue.h
>
> This series is against qemu-kvm.git, please send block patches against
> qemu.git/master in the future.  There is no qemu-kvm.git-specific code
> here so just working against qemu.git and emailing qemu-devel is best.
Do you know how to rebase it from qemu-kvm.git to qemu.git directly?

>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/4] block: add the command line support
  2011-08-09 12:25   ` Stefan Hajnoczi
@ 2011-08-10  5:20     ` Zhi Yong Wu
  2011-08-10  9:27       ` Stefan Hajnoczi
  0 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-10  5:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Tue, Aug 9, 2011 at 8:25 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  Makefile.objs   |    2 +-
>>  blockdev.c      |   39 +++++++++++++++++++++++++++++++++++++++
>>  qemu-config.c   |   24 ++++++++++++++++++++++++
>>  qemu-option.c   |   17 +++++++++++++++++
>>  qemu-option.h   |    1 +
>>  qemu-options.hx |    1 +
>>  6 files changed, 83 insertions(+), 1 deletions(-)
>>
>> diff --git a/Makefile.objs b/Makefile.objs
>> index 9f99ed4..06f2033 100644
>> --- a/Makefile.objs
>> +++ b/Makefile.objs
>> @@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
>>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
>>  block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>>  block-nested-y += qed-check.o
>> -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
>> +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o
>
> This does not build:
>  LINK  qemu-ga
> gcc: error: block/blk-queue.o: No such file or directory
>
> This Makefile.objs change should be in the commit that adds blk-queue.c.
>
> Each patch in a series should compile cleanly and can only depend on
> previous patches.  This is important so that git-bisect(1) can be
> used, it only works if every commit builds a working program.  It also
> makes patch review easier when the patch series builds up logically.
It seems that it will take a bit much time if we strictly stage the
hunks into each corresponding patch.:)
OK, i will.
>
>>  block-nested-$(CONFIG_WIN32) += raw-win32.o
>>  block-nested-$(CONFIG_POSIX) += raw-posix.o
>>  block-nested-$(CONFIG_CURL) += curl.o
>> diff --git a/blockdev.c b/blockdev.c
>> index c263663..9c78548 100644
>> --- a/blockdev.c
>> +++ b/blockdev.c
>> @@ -238,6 +238,10 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>>     int on_read_error, on_write_error;
>>     const char *devaddr;
>>     DriveInfo *dinfo;
>> +    BlockIOLimit io_limits;
>
> This structure is not undefined at this point in the patch series.
The hunk contained this structure will be staged in this patch.

>
>> +    bool iol_flag = false;
>> +    const char *iol_opts[7] = {"bps", "bps_rd", "bps_wr",
>> +                                "iops", "iops_rd", "iops_wr"};
>>     int is_extboot = 0;
>>     int snapshot = 0;
>>     int ret;
>> @@ -372,6 +376,36 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>>         return NULL;
>>     }
>>
>> +    /* disk io throttling */
>> +    iol_flag = qemu_opt_io_limits_enable_flag(opts, iol_opts);
>> +    if (iol_flag) {
>> +        memset(&io_limits, 0, sizeof(BlockIOLimit));
>> +
>> +        io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
>> +                           qemu_opt_get_number(opts, "bps", 0);
>> +        io_limits.bps[BLOCK_IO_LIMIT_READ]   =
>> +                           qemu_opt_get_number(opts, "bps_rd", 0);
>> +        io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
>> +                           qemu_opt_get_number(opts, "bps_wr", 0);
>> +        io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
>> +                           qemu_opt_get_number(opts, "iops", 0);
>> +        io_limits.iops[BLOCK_IO_LIMIT_READ]  =
>> +                           qemu_opt_get_number(opts, "iops_rd", 0);
>> +        io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
>> +                           qemu_opt_get_number(opts, "iops_wr", 0);
>> +
>> +        if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
>> +            && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
>> +            || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
>> +            || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
>> +            && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
>> +            || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
>> +            error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) \
>> +                                        cannot be used at the same time");
>> +            return NULL;
>> +        }
>> +    }
>> +
>>     on_write_error = BLOCK_ERR_STOP_ENOSPC;
>>     if ((buf = qemu_opt_get(opts, "werror")) != NULL) {
>>         if (type != IF_IDE && type != IF_SCSI && type != IF_VIRTIO && type != IF_NONE) {
>> @@ -483,6 +517,11 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>>
>>     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
>>
>> +    /* disk I/O throttling */
>> +    if (iol_flag) {
>> +        bdrv_set_io_limits(dinfo->bdrv, &io_limits);
>> +    }
>
> iol_flag and qemu_opt_io_limits_enable_flag() are not necessary.  If
> no limits were set then all fields will be 0 (unlimited).
Are they not necessary here? why? qemu_opt_io_limits_enable_flag is
used to determine if io_limits is enabled.
If yes, iol_flag will be set to ONE. So i think that they are necessay here.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-09 12:49   ` Stefan Hajnoczi
@ 2011-08-10  5:54     ` Zhi Yong Wu
  2011-08-10  9:37       ` Stefan Hajnoczi
  2011-08-12  8:10     ` Zhi Yong Wu
  1 sibling, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-10  5:54 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> +/* The APIs for block request queue on qemu block layer.
>> + */
>> +
>> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
>> +{
>> +    qemu_aio_release(acb);
>> +}
>> +
>> +static AIOPool block_queue_pool = {
>> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
>> +    .cancel             = qemu_block_queue_cancel,
>> +};
>
> The lifecycle of block_queue_pool acbs should be like this:
>
> When a request is queued we need a BlockQueue acb because we have no
> real acb yet.  So we get an acb from block_queue_pool.
>
> If the acb is cancelled before being dispatched we need to release the
> acb and remove the request from the queue.  (This patch does not
> remove the request from the queue on cancel.)
>
> When the acb is dispatched we need to keep track of the real acb (e.g.
> from qcow2).  The caller will only ever see our acb because there is
> no way to give them the pointer to the new acb after dispatch.  That
> means we need to keep the our acb alive for the entire lifetime of the
> request.  (This patch currently releases our acb when the request is
> dispatched.)
>
> If the acb is cancelled after being dispatched we need to first cancel
> the real acb and then release our acb.
>
> When the acb is dispatched we need to pass qemu_block_queue_callback
> as the cb function to handler.  Inside qemu_block_queue_callback we
> invoke the request cb and then release our acb.  This is how our acb
> should be freed.
To the point, very good.

>
>> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>> +                        BlockDriverState *bs,
>> +                        BlockRequestHandler *handler,
>> +                        int64_t sector_num,
>> +                        QEMUIOVector *qiov,
>> +                        int nb_sectors,
>> +                        BlockDriverCompletionFunc *cb,
>> +                        void *opaque)
>> +{
>> +    BlockIORequest *request;
>> +    BlockDriverAIOCB *acb;
>> +
>> +    request = qemu_malloc(sizeof(BlockIORequest));
>> +    request->bs = bs;
>> +    request->handler = handler;
>> +    request->sector_num = sector_num;
>> +    request->qiov = qiov;
>> +    request->nb_sectors = nb_sectors;
>> +    request->cb = cb;
>> +    request->opaque = opaque;
>> +
>> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>
> It would be simpler to define BlockQueueAIOCB and using it as our acb
> instead of managing an extra BlockIORequest structure.  That way you
> don't need to worry about extra mallocs and frees.
Sorry, i don't get what you mean. how to define it? Can you elaborate?
>
>> +
>> +    acb = qemu_aio_get(&block_queue_pool, bs,
>> +                       qemu_block_queue_callback, opaque);
>> +
>> +    request->acb = acb;
>> +
>> +    return acb;
>> +}
>> +
>> +int qemu_block_queue_handler(BlockIORequest *request)
>
> This function can be static.
Sure.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-09 15:19   ` Stefan Hajnoczi
@ 2011-08-10  6:57     ` Zhi Yong Wu
  2011-08-10 11:00       ` Stefan Hajnoczi
  0 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-10  6:57 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> Note:
>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>
> If an I/O request is larger than the limit itself then I think we
> should let it through with a warning that the limit is too low.  This
If it will print a waring, it seems to be not a nice way, the
throtting function will take no effect on this scenario.
> reflects the fact that we don't split I/O requests into smaller
> requests and the guest may give us 128 KB (or larger?) requests.  In
> practice the lowest feasible limit is probably 256 KB or 512 KB.
>
>> diff --git a/block.c b/block.c
>> index 24a25d5..8fd6643 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -29,6 +29,9 @@
>>  #include "module.h"
>>  #include "qemu-objects.h"
>>
>> +#include "qemu-timer.h"
>> +#include "block/blk-queue.h"
>> +
>>  #ifdef CONFIG_BSD
>>  #include <sys/types.h>
>>  #include <sys/stat.h>
>> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>>                          const uint8_t *buf, int nb_sectors);
>>
>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>> +        bool is_write, double elapsed_time, uint64_t *wait);
>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>> +        double elapsed_time, uint64_t *wait);
>> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
>> +        bool is_write, uint64_t *wait);
>> +
>>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>>     QTAILQ_HEAD_INITIALIZER(bdrv_states);
>>
>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>  }
>>  #endif
>>
>> +/* throttling disk I/O limits */
>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>> +{
>> +    bs->io_limits_enabled = false;
>> +    bs->req_from_queue    = false;
>> +
>> +    if (bs->block_queue) {
>> +        qemu_block_queue_flush(bs->block_queue);
>> +        qemu_del_block_queue(bs->block_queue);
>
> When you fix the acb lifecycle in block-queue.c this will no longer be
> safe.  Requests that are dispatched still have acbs that belong to
When the request is dispatched, if it is successfully serviced, our
acb will been removed.
> this queue.  It is simplest to keep the queue for the lifetime of the
> BlockDriverState - it's just a list so it doesn't take up much memory.
I more prefer to removing this queue, even though it is simple and
harmless to let it exist.

>
>> +    }
>> +
>> +    if (bs->block_timer) {
>> +        qemu_del_timer(bs->block_timer);
>> +        qemu_free_timer(bs->block_timer);
>
> To prevent double frees:
>
> bs->block_timer = NULL;
Can qemu_free_timer() not ensure that it is NULL? This is not necessary.:)

>
>> +    }
>> +
>> +    bs->slice_start[0]   = 0;
>> +    bs->slice_start[1]   = 0;
>> +
>> +    bs->slice_end[0]     = 0;
>> +    bs->slice_end[1]     = 0;
>> +}
>> +
>> +static void bdrv_block_timer(void *opaque)
>> +{
>> +    BlockDriverState *bs = opaque;
>> +    BlockQueue *queue = bs->block_queue;
>> +
>> +    qemu_block_queue_flush(queue);
>> +}
>> +
>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>> +{
>> +    bs->req_from_queue = false;
>> +
>> +    bs->block_queue    = qemu_new_block_queue();
>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>> +
>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>> +
>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>
> The slice times could just be initialized to 0 here.  When
> bdrv_exceed_io_limits() is called it will start a new slice.
The result should be same as current method even though they are set to ZERO.
>
>> +}
>> +
>> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
>> +{
>> +    BlockIOLimit *io_limits = &bs->io_limits;
>> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
>> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
>> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>>  /* check if the path starts with "<protocol>:" */
>>  static int path_has_protocol(const char *path)
>>  {
>> @@ -642,6 +714,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>>             bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>>     }
>>
>> +    /* throttling disk I/O limits */
>> +    if (bs->io_limits_enabled) {
>> +        bdrv_io_limits_enable(bs);
>> +    }
>> +
>>     return 0;
>>
>>  unlink_and_fail:
>> @@ -680,6 +757,16 @@ void bdrv_close(BlockDriverState *bs)
>>         if (bs->change_cb)
>>             bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>>     }
>> +
>> +    /* throttling disk I/O limits */
>> +    if (bs->block_queue) {
>> +        qemu_del_block_queue(bs->block_queue);
>> +    }
>> +
>> +    if (bs->block_timer) {
>> +        qemu_del_timer(bs->block_timer);
>> +        qemu_free_timer(bs->block_timer);
>> +    }
>>  }
>>
>>  void bdrv_close_all(void)
>> @@ -1312,6 +1399,15 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
>>     *psecs = bs->secs;
>>  }
>>
>> +/* throttling disk io limits */
>> +void bdrv_set_io_limits(BlockDriverState *bs,
>> +                            BlockIOLimit *io_limits)
>> +{
>> +    memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
>
> The assignment from *io_limits overwrites all of bs->io_limits, the
> memset() can be removed.
OK.
>
>> +    bs->io_limits = *io_limits;
>> +    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
>> +}
>> +
>>  /* Recognize floppy formats */
>>  typedef struct FDFormat {
>>     FDriveType drive;
>> @@ -1707,6 +1803,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
>>                             qdict_get_bool(qdict, "ro"),
>>                             qdict_get_str(qdict, "drv"),
>>                             qdict_get_bool(qdict, "encrypted"));
>> +
>> +        monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
>> +                            " bps_wr=%" PRId64 " iops=%" PRId64
>> +                            " iops_rd=%" PRId64 " iops_wr=%" PRId64,
>> +                            qdict_get_int(qdict, "bps"),
>> +                            qdict_get_int(qdict, "bps_rd"),
>> +                            qdict_get_int(qdict, "bps_wr"),
>> +                            qdict_get_int(qdict, "iops"),
>> +                            qdict_get_int(qdict, "iops_rd"),
>> +                            qdict_get_int(qdict, "iops_wr"));
>>     } else {
>>         monitor_printf(mon, " [not inserted]");
>>     }
>> @@ -1739,10 +1845,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
>>             QDict *bs_dict = qobject_to_qdict(bs_obj);
>>
>>             obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
>> -                                     "'encrypted': %i }",
>> +                                     "'encrypted': %i, "
>> +                                     "'bps': %" PRId64 ","
>> +                                     "'bps_rd': %" PRId64 ","
>> +                                     "'bps_wr': %" PRId64 ","
>> +                                     "'iops': %" PRId64 ","
>> +                                     "'iops_rd': %" PRId64 ","
>> +                                     "'iops_wr': %" PRId64 "}",
>>                                      bs->filename, bs->read_only,
>>                                      bs->drv->format_name,
>> -                                     bdrv_is_encrypted(bs));
>> +                                     bdrv_is_encrypted(bs),
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
>>             if (bs->backing_file[0] != '\0') {
>>                 QDict *qdict = qobject_to_qdict(obj);
>>                 qdict_put(qdict, "backing_file",
>> @@ -2111,6 +2229,165 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
>>     return buf;
>>  }
>>
>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>> +                 bool is_write, double elapsed_time, uint64_t *wait) {
>> +    uint64_t bps_limit = 0;
>> +    double   bytes_limit, bytes_disp, bytes_res;
>> +    double   slice_time, wait_time;
>> +
>> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
>> +        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
>> +    } else if (bs->io_limits.bps[is_write]) {
>> +        bps_limit = bs->io_limits.bps[is_write];
>> +    } else {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
>> +    bytes_limit = bps_limit * slice_time;
>> +    bytes_disp  = bs->io_disps.bytes[is_write];
>> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
>> +        bytes_disp += bs->io_disps.bytes[!is_write];
>> +    }
>> +
>> +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>> +
>> +    if (bytes_disp + bytes_res <= bytes_limit) {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    /* Calc approx time to dispatch */
>> +    wait_time = (bytes_disp + bytes_res) / bps_limit - elapsed_time;
>> +
>> +    if (wait) {
>> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>> +                             double elapsed_time, uint64_t *wait) {
>> +    uint64_t iops_limit = 0;
>> +    double   ios_limit, ios_disp;
>> +    double   slice_time, wait_time;
>> +
>> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
>> +    } else if (bs->io_limits.iops[is_write]) {
>> +        iops_limit = bs->io_limits.iops[is_write];
>> +    } else {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
>> +    ios_limit  = iops_limit * slice_time;
>> +    ios_disp   = bs->io_disps.ios[is_write];
>> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +        ios_disp += bs->io_disps.ios[!is_write];
>> +    }
>> +
>> +    if (ios_disp + 1 <= ios_limit) {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    /* Calc approx time to dispatch */
>> +    wait_time = (ios_disp + 1) / iops_limit;
>> +    if (wait_time > elapsed_time) {
>> +        wait_time = wait_time - elapsed_time;
>> +    } else {
>> +        wait_time = 0;
>> +    }
>> +
>> +    if (wait) {
>> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
>> +                           bool is_write, uint64_t *wait) {
>> +    int64_t  real_time, real_slice;
>> +    uint64_t bps_wait = 0, iops_wait = 0, max_wait;
>> +    double   elapsed_time;
>> +    int      bps_ret, iops_ret;
>> +
>> +    real_time = qemu_get_clock_ns(vm_clock);
>
> Please call it 'now' instead of 'real_time'.  There is a "real-time
> clock" called rt_clock and this variable name could be confusing.
OK.
>
>> +    real_slice = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    if ((bs->slice_start[is_write] < real_time)
>> +        && (bs->slice_end[is_write] > real_time)) {
>> +        bs->slice_end[is_write]   = real_time + BLOCK_IO_SLICE_TIME;
>> +    } else {
>> +        bs->slice_start[is_write]     = real_time;
>> +        bs->slice_end[is_write]       = real_time + BLOCK_IO_SLICE_TIME;
>> +
>> +        bs->io_disps.bytes[is_write]  = 0;
>> +        bs->io_disps.bytes[!is_write] = 0;
>> +
>> +        bs->io_disps.ios[is_write]    = 0;
>> +        bs->io_disps.ios[!is_write]   = 0;
>> +    }
>> +
>> +    /* If a limit was exceeded, immediately queue this request */
>> +    if (!bs->req_from_queue
>> +        && !QTAILQ_EMPTY(&bs->block_queue->requests)) {
>
> bs->req_from_queue seems to be a way to prevent requests
> unconditionally being queued during flush.  Please define a BlockQueue
> interface instead, something like this:
>
> /* If a limit was exceeded, immediately queue this request */
> if (qemu_block_queue_has_pending(bs->block_queue)) {
>
> Then req_from_queue can be hidden inside BlockQueue.  I'd rename it to
> bool flushing.  qemu_block_queue_flush() will set it to true at the
> start of the function and false at the end of the function.
Good advice.
>
>> +        if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]
>> +            || bs->io_limits.bps[is_write] || bs->io_limits.iops[is_write]
>> +            || bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +            if (wait) {
>> +                *wait = 0;
>
> No estimated wait time?  That's okay only if we stop setting the timer
> whenever a request is queued - more comments about this below.
Yeah.
>
>> +            }
>> +
>> +            return true;
>> +        }
>> +    }
>> +
>> +    elapsed_time  = real_time - bs->slice_start[is_write];
>> +    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);
>
> Why * 10.0?
elapsed_time currently is in ns, and it need to be translated into a
floating value in minutes.
>
>> +
>> +    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
>> +                                      is_write, elapsed_time, &bps_wait);
>> +    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
>> +                                      elapsed_time, &iops_wait);
>> +    if (bps_ret || iops_ret) {
>> +        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
>> +        if (wait) {
>> +            *wait = max_wait;
>> +        }
>> +
>> +        real_time = qemu_get_clock_ns(vm_clock);
>> +        if (bs->slice_end[is_write] < real_time + max_wait) {
>> +            bs->slice_end[is_write] = real_time + max_wait;
>> +        }
>
> Why is this necessary?  We have already extended the slice at the top
> of the function.
Here it will adjust this end of slice time based on
bdrv_exceed_bps_limits(bdrv_exceed_iops_limits).
This will be more accurate.
>
>> +
>> +        return true;
>> +    }
>> +
>> +    if (wait) {
>> +        *wait = 0;
>> +    }
>> +
>> +    return false;
>> +}
>>
>>  /**************************************************************/
>>  /* async I/Os */
>> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>  {
>>     BlockDriver *drv = bs->drv;
>>     BlockDriverAIOCB *ret;
>> +    uint64_t wait_time = 0;
>>
>>     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>>
>> -    if (!drv)
>> -        return NULL;
>> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
>> +    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
>> +        if (bs->io_limits_enabled) {
>> +            bs->req_from_queue = false;
>> +        }
>>         return NULL;
>> +    }
>> +
>> +    /* throttling disk read I/O */
>> +    if (bs->io_limits_enabled) {
>> +        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
>> +                              sector_num, qiov, nb_sectors, cb, opaque);
>> +            qemu_mod_timer(bs->block_timer,
>> +                       wait_time + qemu_get_clock_ns(vm_clock));
>
> If a guest keeps sending I/O requests, say every millisecond, then we
> will delay dispatch forever.  In practice we hope the storage
> interface (virtio-blk, LSI SCSI, etc) has a resource limit that
> prevents this.  But the point remains that this delays requests that
> should be dispatched for too long.  Once the timer has been set we
> should not set it again.
Can you elaborate this?

>
>> +            bs->req_from_queue = false;
>> +            return ret;
>> +        }
>> +    }
>>
>>     ret = drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sectors,
>>                               cb, opaque);
>> @@ -2136,6 +2428,16 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>        /* Update stats even though technically transfer has not happened. */
>>        bs->rd_bytes += (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>>        bs->rd_ops ++;
>> +
>> +        if (bs->io_limits_enabled) {
>> +            bs->io_disps.bytes[BLOCK_IO_LIMIT_READ] +=
>> +                              (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>> +            bs->io_disps.ios[BLOCK_IO_LIMIT_READ]++;
>> +        }
>> +    }
>> +
>> +    if (bs->io_limits_enabled) {
>> +        bs->req_from_queue = false;
>>     }
>>
>>     return ret;
>> @@ -2184,15 +2486,18 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>>     BlockDriver *drv = bs->drv;
>>     BlockDriverAIOCB *ret;
>>     BlockCompleteData *blk_cb_data;
>> +    uint64_t wait_time = 0;
>>
>>     trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque);
>>
>> -    if (!drv)
>> -        return NULL;
>> -    if (bs->read_only)
>> -        return NULL;
>> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
>> +    if (!drv || bs->read_only
>> +        || bdrv_check_request(bs, sector_num, nb_sectors)) {
>> +        if (bs->io_limits_enabled) {
>> +            bs->req_from_queue = false;
>> +        }
>> +
>>         return NULL;
>> +    }
>>
>>     if (bs->dirty_bitmap) {
>>         blk_cb_data = blk_dirty_cb_alloc(bs, sector_num, nb_sectors, cb,
>> @@ -2201,6 +2506,18 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>>         opaque = blk_cb_data;
>>     }
>>
>> +    /* throttling disk write I/O */
>> +    if (bs->io_limits_enabled) {
>> +        if (bdrv_exceed_io_limits(bs, nb_sectors, true, &wait_time)) {
>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_writev,
>> +                                  sector_num, qiov, nb_sectors, cb, opaque);
>> +            qemu_mod_timer(bs->block_timer,
>> +                                  wait_time + qemu_get_clock_ns(vm_clock));
>> +            bs->req_from_queue = false;
>> +            return ret;
>> +        }
>> +    }
>> +
>>     ret = drv->bdrv_aio_writev(bs, sector_num, qiov, nb_sectors,
>>                                cb, opaque);
>>
>> @@ -2211,6 +2528,16 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
>>         if (bs->wr_highest_sector < sector_num + nb_sectors - 1) {
>>             bs->wr_highest_sector = sector_num + nb_sectors - 1;
>>         }
>> +
>> +        if (bs->io_limits_enabled) {
>> +            bs->io_disps.bytes[BLOCK_IO_LIMIT_WRITE] +=
>> +                               (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>> +            bs->io_disps.ios[BLOCK_IO_LIMIT_WRITE]++;
>> +        }
>> +    }
>> +
>> +    if (bs->io_limits_enabled) {
>> +        bs->req_from_queue = false;
>>     }
>>
>>     return ret;
>> diff --git a/block.h b/block.h
>> index 859d1d9..3d02902 100644
>> --- a/block.h
>> +++ b/block.h
>> @@ -57,6 +57,11 @@ void bdrv_info(Monitor *mon, QObject **ret_data);
>>  void bdrv_stats_print(Monitor *mon, const QObject *data);
>>  void bdrv_info_stats(Monitor *mon, QObject **ret_data);
>>
>> +/* disk I/O throttling */
>> +void bdrv_io_limits_enable(BlockDriverState *bs);
>> +void bdrv_io_limits_disable(BlockDriverState *bs);
>> +bool bdrv_io_limits_enabled(BlockDriverState *bs);
>> +
>>  void bdrv_init(void);
>>  void bdrv_init_with_whitelist(void);
>>  BlockDriver *bdrv_find_protocol(const char *filename);
>> @@ -97,7 +102,6 @@ int bdrv_change_backing_file(BlockDriverState *bs,
>>     const char *backing_file, const char *backing_fmt);
>>  void bdrv_register(BlockDriver *bdrv);
>>
>> -
>>  typedef struct BdrvCheckResult {
>>     int corruptions;
>>     int leaks;
>> diff --git a/block_int.h b/block_int.h
>> index 1e265d2..a4cd458 100644
>> --- a/block_int.h
>> +++ b/block_int.h
>> @@ -27,10 +27,17 @@
>>  #include "block.h"
>>  #include "qemu-option.h"
>>  #include "qemu-queue.h"
>> +#include "block/blk-queue.h"
>>
>>  #define BLOCK_FLAG_ENCRYPT     1
>>  #define BLOCK_FLAG_COMPAT6     4
>>
>> +#define BLOCK_IO_LIMIT_READ     0
>> +#define BLOCK_IO_LIMIT_WRITE    1
>> +#define BLOCK_IO_LIMIT_TOTAL    2
>> +
>> +#define BLOCK_IO_SLICE_TIME     100000000
>
> Please add a comment indicating the units: /* nanoseconds */
>
>> +
>>  #define BLOCK_OPT_SIZE          "size"
>>  #define BLOCK_OPT_ENCRYPT       "encryption"
>>  #define BLOCK_OPT_COMPAT6       "compat6"
>> @@ -46,6 +53,16 @@ typedef struct AIOPool {
>>     BlockDriverAIOCB *free_aiocb;
>>  } AIOPool;
>>
>> +typedef struct BlockIOLimit {
>> +    uint64_t bps[3];
>> +    uint64_t iops[3];
>> +} BlockIOLimit;
>> +
>> +typedef struct BlockIODisp {
>> +    uint64_t bytes[2];
>> +    uint64_t ios[2];
>> +} BlockIODisp;
>> +
>>  struct BlockDriver {
>>     const char *format_name;
>>     int instance_size;
>> @@ -175,6 +192,16 @@ struct BlockDriverState {
>>
>>     void *sync_aiocb;
>>
>> +    /* the time for latest disk I/O */
>> +    int64_t slice_start[2];
>> +    int64_t slice_end[2];
>> +    BlockIOLimit io_limits;
>> +    BlockIODisp  io_disps;
>> +    BlockQueue   *block_queue;
>> +    QEMUTimer    *block_timer;
>> +    bool         io_limits_enabled;
>> +    bool         req_from_queue;
>> +
>>     /* I/O stats (display with "info blockstats"). */
>>     uint64_t rd_bytes;
>>     uint64_t wr_bytes;
>> @@ -222,6 +249,9 @@ void qemu_aio_release(void *p);
>>
>>  void *qemu_blockalign(BlockDriverState *bs, size_t size);
>>
>> +void bdrv_set_io_limits(BlockDriverState *bs,
>> +                            BlockIOLimit *io_limits);
>> +
>>  #ifdef _WIN32
>>  int is_windows_drive(const char *filename);
>>  #endif
>> --
>> 1.7.2.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/4] block: add the command line support
  2011-08-10  5:20     ` Zhi Yong Wu
@ 2011-08-10  9:27       ` Stefan Hajnoczi
  2011-08-11  4:44         ` Zhi Yong Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-10  9:27 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 01:20:22PM +0800, Zhi Yong Wu wrote:
> On Tue, Aug 9, 2011 at 8:25 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >> ---
> >>  Makefile.objs   |    2 +-
> >>  blockdev.c      |   39 +++++++++++++++++++++++++++++++++++++++
> >>  qemu-config.c   |   24 ++++++++++++++++++++++++
> >>  qemu-option.c   |   17 +++++++++++++++++
> >>  qemu-option.h   |    1 +
> >>  qemu-options.hx |    1 +
> >>  6 files changed, 83 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/Makefile.objs b/Makefile.objs
> >> index 9f99ed4..06f2033 100644
> >> --- a/Makefile.objs
> >> +++ b/Makefile.objs
> >> @@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
> >>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
> >>  block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
> >>  block-nested-y += qed-check.o
> >> -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
> >> +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o
> >
> > This does not build:
> >  LINK  qemu-ga
> > gcc: error: block/blk-queue.o: No such file or directory
> >
> > This Makefile.objs change should be in the commit that adds blk-queue.c.
> >
> > Each patch in a series should compile cleanly and can only depend on
> > previous patches.  This is important so that git-bisect(1) can be
> > used, it only works if every commit builds a working program.  It also
> > makes patch review easier when the patch series builds up logically.
> It seems that it will take a bit much time if we strictly stage the
> hunks into each corresponding patch.:)
> OK, i will.

Some people like using Stacked Git to manage patch series:
http://www.procode.org/stgit/

I typically just use git rebase -i and git add -i manually to clean up
patch series.

It also becomes easier once you plan to write patches that follow these
guidelines.

> >> +    /* disk io throttling */
> >> +    iol_flag = qemu_opt_io_limits_enable_flag(opts, iol_opts);
> >> +    if (iol_flag) {
> >> +        memset(&io_limits, 0, sizeof(BlockIOLimit));
> >> +
> >> +        io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
> >> +                           qemu_opt_get_number(opts, "bps", 0);
> >> +        io_limits.bps[BLOCK_IO_LIMIT_READ]   =
> >> +                           qemu_opt_get_number(opts, "bps_rd", 0);
> >> +        io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
> >> +                           qemu_opt_get_number(opts, "bps_wr", 0);
> >> +        io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
> >> +                           qemu_opt_get_number(opts, "iops", 0);
> >> +        io_limits.iops[BLOCK_IO_LIMIT_READ]  =
> >> +                           qemu_opt_get_number(opts, "iops_rd", 0);
> >> +        io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
> >> +                           qemu_opt_get_number(opts, "iops_wr", 0);
> >> +
> >> +        if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
> >> +            && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
> >> +            || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
> >> +            || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
> >> +            && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
> >> +            || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
> >> +            error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) \
> >> +                                        cannot be used at the same time");
> >> +            return NULL;
> >> +        }
> >> +    }
> >> +
> >>     on_write_error = BLOCK_ERR_STOP_ENOSPC;
> >>     if ((buf = qemu_opt_get(opts, "werror")) != NULL) {
> >>         if (type != IF_IDE && type != IF_SCSI && type != IF_VIRTIO && type != IF_NONE) {
> >> @@ -483,6 +517,11 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
> >>
> >>     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
> >>
> >> +    /* disk I/O throttling */
> >> +    if (iol_flag) {
> >> +        bdrv_set_io_limits(dinfo->bdrv, &io_limits);
> >> +    }
> >
> > iol_flag and qemu_opt_io_limits_enable_flag() are not necessary.  If
> > no limits were set then all fields will be 0 (unlimited).
> Are they not necessary here? why? qemu_opt_io_limits_enable_flag is
> used to determine if io_limits is enabled.
> If yes, iol_flag will be set to ONE. So i think that they are necessay here.

There are two possible cases: the user does not set any options or the
user sets at least one option.  In both cases io_limits will be
initialized correctly, here is why:

When an option is not specified by the user the value will be 0, which
means "unlimited".  bdrv_set_io_limits() calls
bdrv_io_limits_enabled(bs) to check whether any I/O limit is non-zero.
This means the iol_flag check is already being done by
bdrv_set_io_limits() and there is no need to duplicate it.  The iol_flag
code can be eliminated and the program will behave the same.

The code would look something like this:

io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
                   qemu_opt_get_number(opts, "bps", 0);
io_limits.bps[BLOCK_IO_LIMIT_READ]   =
                   qemu_opt_get_number(opts, "bps_rd", 0);
io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
                   qemu_opt_get_number(opts, "bps_wr", 0);
io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
                   qemu_opt_get_number(opts, "iops", 0);
io_limits.iops[BLOCK_IO_LIMIT_READ]  =
                   qemu_opt_get_number(opts, "iops_rd", 0);
io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
                   qemu_opt_get_number(opts, "iops_wr", 0);

if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
    && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
    || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
    || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
    && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
    || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
    error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) "
                 "cannot be used at the same time");
    return NULL;
}

bdrv_set_io_limits(dinfo->bdrv, &io_limits);

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-10  5:54     ` Zhi Yong Wu
@ 2011-08-10  9:37       ` Stefan Hajnoczi
  2011-08-11  4:59         ` Zhi Yong Wu
                           ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-10  9:37 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 01:54:33PM +0800, Zhi Yong Wu wrote:
> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> >> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
> >> +                        BlockDriverState *bs,
> >> +                        BlockRequestHandler *handler,
> >> +                        int64_t sector_num,
> >> +                        QEMUIOVector *qiov,
> >> +                        int nb_sectors,
> >> +                        BlockDriverCompletionFunc *cb,
> >> +                        void *opaque)
> >> +{
> >> +    BlockIORequest *request;
> >> +    BlockDriverAIOCB *acb;
> >> +
> >> +    request = qemu_malloc(sizeof(BlockIORequest));
> >> +    request->bs = bs;
> >> +    request->handler = handler;
> >> +    request->sector_num = sector_num;
> >> +    request->qiov = qiov;
> >> +    request->nb_sectors = nb_sectors;
> >> +    request->cb = cb;
> >> +    request->opaque = opaque;
> >> +
> >> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
> >
> > It would be simpler to define BlockQueueAIOCB and using it as our acb
> > instead of managing an extra BlockIORequest structure.  That way you
> > don't need to worry about extra mallocs and frees.
> Sorry, i don't get what you mean. how to define it? Can you elaborate?

BlockDriverAIOCB is designed to be embedded inside a bigger struct.  For
example, QEDAIOCB is a larger struct that contains BlockDriverAIOCB as
its first field:

typedef struct QEDAIOCB {
    BlockDriverAIOCB common;
    ...
} QEDAIOCB;

And the QED AIOPool contains the size of QEDAIOCB so that qemu_aio_get() can
allocate the full QEDAIOCB struct:

static AIOPool qed_aio_pool = {
    .aiocb_size         = sizeof(QEDAIOCB),
    .cancel             = qed_aio_cancel,
};

This allows QED to store per-request state in QEDAIOCB for the lifetime of a
request:

QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);

acb->is_write = is_write;
acb->finished = NULL;
acb->qiov = qiov;
...

I suggest creating a BlockQueueAIOCB that contains the fields from
BlockIORequest (which is no longer needed as a separate struct):

typedef struct BlockQueueAIOCB {
    BlockDriverAIOCB common;
    BlockRequestHandler *handler;
    int64_t sector_num;
    QEMUIOVector *qiov;
    int nb_sectors;
    QTAILQ_ENTRY(BlockQueueAIOCB) entry; /* pending request queue */
} BlockQueueAIOCB;

Now you can drop the malloc and simply qemu_aio_get() a new
BlockQueueAIOCB.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling
  2011-08-10  5:09   ` Zhi Yong Wu
@ 2011-08-10  9:39     ` Stefan Hajnoczi
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-10  9:39 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 01:09:27PM +0800, Zhi Yong Wu wrote:
> On Tue, Aug 9, 2011 at 8:08 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> >>  Makefile.objs     |    2 +-
> >>  block.c           |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++--
> >>  block.h           |    6 +-
> >>  block/blk-queue.c |  141 ++++++++++++++++++++++
> >>  block/blk-queue.h |   73 +++++++++++
> >>  block_int.h       |   30 +++++
> >>  blockdev.c        |  108 +++++++++++++++++
> >>  blockdev.h        |    2 +
> >>  hmp-commands.hx   |   15 +++
> >>  qemu-config.c     |   24 ++++
> >>  qemu-option.c     |   17 +++
> >>  qemu-option.h     |    1 +
> >>  qemu-options.hx   |    1 +
> >>  qerror.c          |    4 +
> >>  qerror.h          |    3 +
> >>  qmp-commands.hx   |   52 ++++++++-
> >>  16 files changed, 813 insertions(+), 13 deletions(-)
> >>  create mode 100644 block/blk-queue.c
> >>  create mode 100644 block/blk-queue.h
> >
> > This series is against qemu-kvm.git, please send block patches against
> > qemu.git/master in the future.  There is no qemu-kvm.git-specific code
> > here so just working against qemu.git and emailing qemu-devel is best.
> Do you know how to rebase it from qemu-kvm.git to qemu.git directly?

This is what I do:

# Export patches for all commits I have made since master
$ cd qemu-kvm
$ git format-patch master..

# Import patches into qemu.git
$ cd qemu && git checkout -b io_limits master
$ git am ~/qemu-kvm/*.patch

git-am(1) will ask you to resolve any conflicts.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-10  6:57     ` Zhi Yong Wu
@ 2011-08-10 11:00       ` Stefan Hajnoczi
  2011-08-12  5:00         ` Zhi Yong Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-10 11:00 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>> Note:
>>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>
>> If an I/O request is larger than the limit itself then I think we
>> should let it through with a warning that the limit is too low.  This
> If it will print a waring, it seems to be not a nice way, the
> throtting function will take no effect on this scenario.

Setting the limit below the max request size is a misconfiguration.
It's a problem that the user needs to be aware of and fix, so a
warning is appropriate.  Unfortunately the max request size is not
easy to determine from the block layer and may vary between guest
OSes, that's why we need to do something at runtime.

We cannot leave this case unhandled because it results in the guest
timing out I/O without any information about the problem - that makes
it hard to troubleshoot for the user.  Any other ideas on how to
handle this case?

>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>>  }
>>>  #endif
>>>
>>> +/* throttling disk I/O limits */
>>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>>> +{
>>> +    bs->io_limits_enabled = false;
>>> +    bs->req_from_queue    = false;
>>> +
>>> +    if (bs->block_queue) {
>>> +        qemu_block_queue_flush(bs->block_queue);
>>> +        qemu_del_block_queue(bs->block_queue);
>>
>> When you fix the acb lifecycle in block-queue.c this will no longer be
>> safe.  Requests that are dispatched still have acbs that belong to
> When the request is dispatched, if it is successfully serviced, our
> acb will been removed.

Yes, that is how the patches work right now.  But the acb lifecycle
that I explained in response to the other patch changes this.

>> this queue.  It is simplest to keep the queue for the lifetime of the
>> BlockDriverState - it's just a list so it doesn't take up much memory.
> I more prefer to removing this queue, even though it is simple and
> harmless to let it exist.

If the existing acbs do not touch their BlockQueue after this point in
the code, then it will be safe to delete the queue since it will no
longer be referenced.  If you can guarantee this then it's fine to
keep this code.

>
>>
>>> +    }
>>> +
>>> +    if (bs->block_timer) {
>>> +        qemu_del_timer(bs->block_timer);
>>> +        qemu_free_timer(bs->block_timer);
>>
>> To prevent double frees:
>>
>> bs->block_timer = NULL;
> Can qemu_free_timer() not ensure that it is NULL? This is not necessary.:)

No, qemu_free_timer() is a function.  qemu_free_timer() cannot change
the pointer variable bs->block_timer because in C functions take
arguments by value.

After qemu_free_timer() bs->block_timer will still hold the old
address of the (freed) timer.  Then when bdrv_close() is called it
does qemu_free_timer() again because bs->block_timer is not NULL.  It
is illegal to free() a pointer twice and it can cause a crash.

>
>>
>>> +    }
>>> +
>>> +    bs->slice_start[0]   = 0;
>>> +    bs->slice_start[1]   = 0;
>>> +
>>> +    bs->slice_end[0]     = 0;
>>> +    bs->slice_end[1]     = 0;
>>> +}
>>> +
>>> +static void bdrv_block_timer(void *opaque)
>>> +{
>>> +    BlockDriverState *bs = opaque;
>>> +    BlockQueue *queue = bs->block_queue;
>>> +
>>> +    qemu_block_queue_flush(queue);
>>> +}
>>> +
>>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>>> +{
>>> +    bs->req_from_queue = false;
>>> +
>>> +    bs->block_queue    = qemu_new_block_queue();
>>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>>> +
>>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>>> +
>>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>
>> The slice times could just be initialized to 0 here.  When
>> bdrv_exceed_io_limits() is called it will start a new slice.
> The result should be same as current method even though they are set to ZERO.

Yes, the result is the same except we only create new slices in one place.

>>> +            }
>>> +
>>> +            return true;
>>> +        }
>>> +    }
>>> +
>>> +    elapsed_time  = real_time - bs->slice_start[is_write];
>>> +    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);
>>
>> Why * 10.0?
> elapsed_time currently is in ns, and it need to be translated into a
> floating value in minutes.

I think you mean seconds instead of minutes.  Then the conversion has
nothing to do with BLOCK_IO_SLICE_TIME, it's just nanoseconds to
seconds.  It would be clearer to drop BLOCK_IO_SLICE_TIME from the
expression and divide by a new NANOSECONDS_PER_SECOND constant
(1000000000.0).

>>
>>> +
>>> +    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
>>> +                                      is_write, elapsed_time, &bps_wait);
>>> +    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
>>> +                                      elapsed_time, &iops_wait);
>>> +    if (bps_ret || iops_ret) {
>>> +        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
>>> +        if (wait) {
>>> +            *wait = max_wait;
>>> +        }
>>> +
>>> +        real_time = qemu_get_clock_ns(vm_clock);
>>> +        if (bs->slice_end[is_write] < real_time + max_wait) {
>>> +            bs->slice_end[is_write] = real_time + max_wait;
>>> +        }
>>
>> Why is this necessary?  We have already extended the slice at the top
>> of the function.
> Here it will adjust this end of slice time based on
> bdrv_exceed_bps_limits(bdrv_exceed_iops_limits).
> This will be more accurate.

After rereading it I realized that this extends the slice up to the
estimated dispatch time (max_wait) and not just by
BLOCK_IO_SLICE_TIME.  This code now makes sense to me: we're trying to
keep slices around while there are queued requests so that
iops/throughput history is not forgotten when queued requests are
dispatched.

>>
>>> +
>>> +        return true;
>>> +    }
>>> +
>>> +    if (wait) {
>>> +        *wait = 0;
>>> +    }
>>> +
>>> +    return false;
>>> +}
>>>
>>>  /**************************************************************/
>>>  /* async I/Os */
>>> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>>  {
>>>     BlockDriver *drv = bs->drv;
>>>     BlockDriverAIOCB *ret;
>>> +    uint64_t wait_time = 0;
>>>
>>>     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>>>
>>> -    if (!drv)
>>> -        return NULL;
>>> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
>>> +    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
>>> +        if (bs->io_limits_enabled) {
>>> +            bs->req_from_queue = false;
>>> +        }
>>>         return NULL;
>>> +    }
>>> +
>>> +    /* throttling disk read I/O */
>>> +    if (bs->io_limits_enabled) {
>>> +        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
>>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
>>> +                              sector_num, qiov, nb_sectors, cb, opaque);
>>> +            qemu_mod_timer(bs->block_timer,
>>> +                       wait_time + qemu_get_clock_ns(vm_clock));
>>
>> If a guest keeps sending I/O requests, say every millisecond, then we
>> will delay dispatch forever.  In practice we hope the storage
>> interface (virtio-blk, LSI SCSI, etc) has a resource limit that
>> prevents this.  But the point remains that this delays requests that
>> should be dispatched for too long.  Once the timer has been set we
>> should not set it again.
> Can you elaborate this?

Let's wait until the next version of the patch, I've suggested many
changes and should comment against current code instead of the way I
think it is going to work :).

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/4] block: add the command line support
  2011-08-10  9:27       ` Stefan Hajnoczi
@ 2011-08-11  4:44         ` Zhi Yong Wu
  2011-08-11  5:35           ` Stefan Hajnoczi
  0 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-11  4:44 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 5:27 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> On Wed, Aug 10, 2011 at 01:20:22PM +0800, Zhi Yong Wu wrote:
>> On Tue, Aug 9, 2011 at 8:25 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> >> ---
>> >>  Makefile.objs   |    2 +-
>> >>  blockdev.c      |   39 +++++++++++++++++++++++++++++++++++++++
>> >>  qemu-config.c   |   24 ++++++++++++++++++++++++
>> >>  qemu-option.c   |   17 +++++++++++++++++
>> >>  qemu-option.h   |    1 +
>> >>  qemu-options.hx |    1 +
>> >>  6 files changed, 83 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/Makefile.objs b/Makefile.objs
>> >> index 9f99ed4..06f2033 100644
>> >> --- a/Makefile.objs
>> >> +++ b/Makefile.objs
>> >> @@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
>> >>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
>> >>  block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
>> >>  block-nested-y += qed-check.o
>> >> -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
>> >> +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o
>> >
>> > This does not build:
>> >  LINK  qemu-ga
>> > gcc: error: block/blk-queue.o: No such file or directory
>> >
>> > This Makefile.objs change should be in the commit that adds blk-queue.c.
>> >
>> > Each patch in a series should compile cleanly and can only depend on
>> > previous patches.  This is important so that git-bisect(1) can be
>> > used, it only works if every commit builds a working program.  It also
>> > makes patch review easier when the patch series builds up logically.
>> It seems that it will take a bit much time if we strictly stage the
>> hunks into each corresponding patch.:)
>> OK, i will.
>
> Some people like using Stacked Git to manage patch series:
> http://www.procode.org/stgit/
Let me try.
>
> I typically just use git rebase -i and git add -i manually to clean up
> patch series.
>
> It also becomes easier once you plan to write patches that follow these
> guidelines.
OK
>
>> >> +    /* disk io throttling */
>> >> +    iol_flag = qemu_opt_io_limits_enable_flag(opts, iol_opts);
>> >> +    if (iol_flag) {
>> >> +        memset(&io_limits, 0, sizeof(BlockIOLimit));
>> >> +
>> >> +        io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
>> >> +                           qemu_opt_get_number(opts, "bps", 0);
>> >> +        io_limits.bps[BLOCK_IO_LIMIT_READ]   =
>> >> +                           qemu_opt_get_number(opts, "bps_rd", 0);
>> >> +        io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
>> >> +                           qemu_opt_get_number(opts, "bps_wr", 0);
>> >> +        io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
>> >> +                           qemu_opt_get_number(opts, "iops", 0);
>> >> +        io_limits.iops[BLOCK_IO_LIMIT_READ]  =
>> >> +                           qemu_opt_get_number(opts, "iops_rd", 0);
>> >> +        io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
>> >> +                           qemu_opt_get_number(opts, "iops_wr", 0);
>> >> +
>> >> +        if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
>> >> +            && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
>> >> +            || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
>> >> +            || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
>> >> +            && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
>> >> +            || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
>> >> +            error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) \
>> >> +                                        cannot be used at the same time");
>> >> +            return NULL;
>> >> +        }
>> >> +    }
>> >> +
>> >>     on_write_error = BLOCK_ERR_STOP_ENOSPC;
>> >>     if ((buf = qemu_opt_get(opts, "werror")) != NULL) {
>> >>         if (type != IF_IDE && type != IF_SCSI && type != IF_VIRTIO && type != IF_NONE) {
>> >> @@ -483,6 +517,11 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
>> >>
>> >>     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
>> >>
>> >> +    /* disk I/O throttling */
>> >> +    if (iol_flag) {
>> >> +        bdrv_set_io_limits(dinfo->bdrv, &io_limits);
>> >> +    }
>> >
>> > iol_flag and qemu_opt_io_limits_enable_flag() are not necessary.  If
>> > no limits were set then all fields will be 0 (unlimited).
>> Are they not necessary here? why? qemu_opt_io_limits_enable_flag is
>> used to determine if io_limits is enabled.
>> If yes, iol_flag will be set to ONE. So i think that they are necessay here.
>
> There are two possible cases: the user does not set any options or the
> user sets at least one option.  In both cases io_limits will be
> initialized correctly, here is why:
>
> When an option is not specified by the user the value will be 0, which
> means "unlimited".  bdrv_set_io_limits() calls
> bdrv_io_limits_enabled(bs) to check whether any I/O limit is non-zero.
> This means the iol_flag check is already being done by
> bdrv_set_io_limits() and there is no need to duplicate it.  The iol_flag
> code can be eliminated and the program will behave the same.
>
> The code would look something like this:
If the user doesn't set any io_limits options, this chunk of codes are
also executed. anyway, that's OK. thanks.
>
> io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
>                    qemu_opt_get_number(opts, "bps", 0);
> io_limits.bps[BLOCK_IO_LIMIT_READ]   =
>                    qemu_opt_get_number(opts, "bps_rd", 0);
> io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
>                    qemu_opt_get_number(opts, "bps_wr", 0);
> io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
>                    qemu_opt_get_number(opts, "iops", 0);
> io_limits.iops[BLOCK_IO_LIMIT_READ]  =
>                    qemu_opt_get_number(opts, "iops_rd", 0);
> io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
>                   qemu_opt_get_number(opts, "iops_wr", 0);
>
> if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
>     && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
>     || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
>     || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
>     && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
>     || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
>     error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) "
>                 "cannot be used at the same time");
>    return NULL;
> }
>
> bdrv_set_io_limits(dinfo->bdrv, &io_limits);
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-10  9:37       ` Stefan Hajnoczi
@ 2011-08-11  4:59         ` Zhi Yong Wu
  2011-08-11  5:36         ` Zhi Yong Wu
  2011-08-12  4:40         ` Zhi Yong Wu
  2 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-11  4:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 5:37 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> On Wed, Aug 10, 2011 at 01:54:33PM +0800, Zhi Yong Wu wrote:
>> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> >> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>> >> +                        BlockDriverState *bs,
>> >> +                        BlockRequestHandler *handler,
>> >> +                        int64_t sector_num,
>> >> +                        QEMUIOVector *qiov,
>> >> +                        int nb_sectors,
>> >> +                        BlockDriverCompletionFunc *cb,
>> >> +                        void *opaque)
>> >> +{
>> >> +    BlockIORequest *request;
>> >> +    BlockDriverAIOCB *acb;
>> >> +
>> >> +    request = qemu_malloc(sizeof(BlockIORequest));
>> >> +    request->bs = bs;
>> >> +    request->handler = handler;
>> >> +    request->sector_num = sector_num;
>> >> +    request->qiov = qiov;
>> >> +    request->nb_sectors = nb_sectors;
>> >> +    request->cb = cb;
>> >> +    request->opaque = opaque;
>> >> +
>> >> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>> >
>> > It would be simpler to define BlockQueueAIOCB and using it as our acb
>> > instead of managing an extra BlockIORequest structure.  That way you
>> > don't need to worry about extra mallocs and frees.
>> Sorry, i don't get what you mean. how to define it? Can you elaborate?
>
> BlockDriverAIOCB is designed to be embedded inside a bigger struct.  For
> example, QEDAIOCB is a larger struct that contains BlockDriverAIOCB as
> its first field:
>
> typedef struct QEDAIOCB {
>    BlockDriverAIOCB common;
>    ...
> } QEDAIOCB;
>
> And the QED AIOPool contains the size of QEDAIOCB so that qemu_aio_get() can
> allocate the full QEDAIOCB struct:
>
> static AIOPool qed_aio_pool = {
>    .aiocb_size         = sizeof(QEDAIOCB),
>    .cancel             = qed_aio_cancel,
> };
>
> This allows QED to store per-request state in QEDAIOCB for the lifetime of a
> request:
>
> QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
>
> acb->is_write = is_write;
> acb->finished = NULL;
> acb->qiov = qiov;
> ...
>
> I suggest creating a BlockQueueAIOCB that contains the fields from
> BlockIORequest (which is no longer needed as a separate struct):
>
> typedef struct BlockQueueAIOCB {
>    BlockDriverAIOCB common;
>    BlockRequestHandler *handler;
>    int64_t sector_num;
>    QEMUIOVector *qiov;
>    int nb_sectors;
>    QTAILQ_ENTRY(BlockQueueAIOCB) entry; /* pending request queue */
> } BlockQueueAIOCB;
>
> Now you can drop the malloc and simply qemu_aio_get() a new
> BlockQueueAIOCB.
Yesterday, i had actually done as this:). thanks.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 1/4] block: add the command line support
  2011-08-11  4:44         ` Zhi Yong Wu
@ 2011-08-11  5:35           ` Stefan Hajnoczi
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-11  5:35 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Thu, Aug 11, 2011 at 12:44:11PM +0800, Zhi Yong Wu wrote:
> On Wed, Aug 10, 2011 at 5:27 PM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
> > On Wed, Aug 10, 2011 at 01:20:22PM +0800, Zhi Yong Wu wrote:
> >> On Tue, Aug 9, 2011 at 8:25 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
> >> >> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
> >> >> ---
> >> >>  Makefile.objs   |    2 +-
> >> >>  blockdev.c      |   39 +++++++++++++++++++++++++++++++++++++++
> >> >>  qemu-config.c   |   24 ++++++++++++++++++++++++
> >> >>  qemu-option.c   |   17 +++++++++++++++++
> >> >>  qemu-option.h   |    1 +
> >> >>  qemu-options.hx |    1 +
> >> >>  6 files changed, 83 insertions(+), 1 deletions(-)
> >> >>
> >> >> diff --git a/Makefile.objs b/Makefile.objs
> >> >> index 9f99ed4..06f2033 100644
> >> >> --- a/Makefile.objs
> >> >> +++ b/Makefile.objs
> >> >> @@ -23,7 +23,7 @@ block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vv
> >> >>  block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o
> >> >>  block-nested-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
> >> >>  block-nested-y += qed-check.o
> >> >> -block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o
> >> >> +block-nested-y += parallels.o nbd.o blkdebug.o sheepdog.o blkverify.o blk-queue.o
> >> >
> >> > This does not build:
> >> >  LINK  qemu-ga
> >> > gcc: error: block/blk-queue.o: No such file or directory
> >> >
> >> > This Makefile.objs change should be in the commit that adds blk-queue.c.
> >> >
> >> > Each patch in a series should compile cleanly and can only depend on
> >> > previous patches.  This is important so that git-bisect(1) can be
> >> > used, it only works if every commit builds a working program.  It also
> >> > makes patch review easier when the patch series builds up logically.
> >> It seems that it will take a bit much time if we strictly stage the
> >> hunks into each corresponding patch.:)
> >> OK, i will.
> >
> > Some people like using Stacked Git to manage patch series:
> > http://www.procode.org/stgit/
> Let me try.
> >
> > I typically just use git rebase -i and git add -i manually to clean up
> > patch series.
> >
> > It also becomes easier once you plan to write patches that follow these
> > guidelines.
> OK
> >
> >> >> +    /* disk io throttling */
> >> >> +    iol_flag = qemu_opt_io_limits_enable_flag(opts, iol_opts);
> >> >> +    if (iol_flag) {
> >> >> +        memset(&io_limits, 0, sizeof(BlockIOLimit));
> >> >> +
> >> >> +        io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
> >> >> +                           qemu_opt_get_number(opts, "bps", 0);
> >> >> +        io_limits.bps[BLOCK_IO_LIMIT_READ]   =
> >> >> +                           qemu_opt_get_number(opts, "bps_rd", 0);
> >> >> +        io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
> >> >> +                           qemu_opt_get_number(opts, "bps_wr", 0);
> >> >> +        io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
> >> >> +                           qemu_opt_get_number(opts, "iops", 0);
> >> >> +        io_limits.iops[BLOCK_IO_LIMIT_READ]  =
> >> >> +                           qemu_opt_get_number(opts, "iops_rd", 0);
> >> >> +        io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
> >> >> +                           qemu_opt_get_number(opts, "iops_wr", 0);
> >> >> +
> >> >> +        if (((io_limits.bps[BLOCK_IO_LIMIT_TOTAL] != 0)
> >> >> +            && ((io_limits.bps[BLOCK_IO_LIMIT_READ] != 0)
> >> >> +            || (io_limits.bps[BLOCK_IO_LIMIT_WRITE] != 0)))
> >> >> +            || ((io_limits.iops[BLOCK_IO_LIMIT_TOTAL] != 0)
> >> >> +            && ((io_limits.iops[BLOCK_IO_LIMIT_READ] != 0)
> >> >> +            || (io_limits.iops[BLOCK_IO_LIMIT_WRITE] != 0)))) {
> >> >> +            error_report("bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) \
> >> >> +                                        cannot be used at the same time");
> >> >> +            return NULL;
> >> >> +        }
> >> >> +    }
> >> >> +
> >> >>     on_write_error = BLOCK_ERR_STOP_ENOSPC;
> >> >>     if ((buf = qemu_opt_get(opts, "werror")) != NULL) {
> >> >>         if (type != IF_IDE && type != IF_SCSI && type != IF_VIRTIO && type != IF_NONE) {
> >> >> @@ -483,6 +517,11 @@ DriveInfo *drive_init(QemuOpts *opts, int default_to_scsi)
> >> >>
> >> >>     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
> >> >>
> >> >> +    /* disk I/O throttling */
> >> >> +    if (iol_flag) {
> >> >> +        bdrv_set_io_limits(dinfo->bdrv, &io_limits);
> >> >> +    }
> >> >
> >> > iol_flag and qemu_opt_io_limits_enable_flag() are not necessary.  If
> >> > no limits were set then all fields will be 0 (unlimited).
> >> Are they not necessary here? why? qemu_opt_io_limits_enable_flag is
> >> used to determine if io_limits is enabled.
> >> If yes, iol_flag will be set to ONE. So i think that they are necessay here.
> >
> > There are two possible cases: the user does not set any options or the
> > user sets at least one option.  In both cases io_limits will be
> > initialized correctly, here is why:
> >
> > When an option is not specified by the user the value will be 0, which
> > means "unlimited".  bdrv_set_io_limits() calls
> > bdrv_io_limits_enabled(bs) to check whether any I/O limit is non-zero.
> > This means the iol_flag check is already being done by
> > bdrv_set_io_limits() and there is no need to duplicate it.  The iol_flag
> > code can be eliminated and the program will behave the same.
> >
> > The code would look something like this:
> If the user doesn't set any io_limits options, this chunk of codes are
> also executed. anyway, that's OK. thanks.

Yes, that's why qemu_opt_get_number() takes a default value.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-10  9:37       ` Stefan Hajnoczi
  2011-08-11  4:59         ` Zhi Yong Wu
@ 2011-08-11  5:36         ` Zhi Yong Wu
  2011-08-12  4:40         ` Zhi Yong Wu
  2 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-11  5:36 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 5:37 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> On Wed, Aug 10, 2011 at 01:54:33PM +0800, Zhi Yong Wu wrote:
>> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> >> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>> >> +                        BlockDriverState *bs,
>> >> +                        BlockRequestHandler *handler,
>> >> +                        int64_t sector_num,
>> >> +                        QEMUIOVector *qiov,
>> >> +                        int nb_sectors,
>> >> +                        BlockDriverCompletionFunc *cb,
>> >> +                        void *opaque)
>> >> +{
>> >> +    BlockIORequest *request;
>> >> +    BlockDriverAIOCB *acb;
>> >> +
>> >> +    request = qemu_malloc(sizeof(BlockIORequest));
>> >> +    request->bs = bs;
>> >> +    request->handler = handler;
>> >> +    request->sector_num = sector_num;
>> >> +    request->qiov = qiov;
>> >> +    request->nb_sectors = nb_sectors;
>> >> +    request->cb = cb;
>> >> +    request->opaque = opaque;
>> >> +
>> >> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>> >
>> > It would be simpler to define BlockQueueAIOCB and using it as our acb
>> > instead of managing an extra BlockIORequest structure.  That way you
>> > don't need to worry about extra mallocs and frees.
>> Sorry, i don't get what you mean. how to define it? Can you elaborate?
>
> BlockDriverAIOCB is designed to be embedded inside a bigger struct.  For
> example, QEDAIOCB is a larger struct that contains BlockDriverAIOCB as
> its first field:
>
> typedef struct QEDAIOCB {
>    BlockDriverAIOCB common;
>    ...
> } QEDAIOCB;
>
> And the QED AIOPool contains the size of QEDAIOCB so that qemu_aio_get() can
> allocate the full QEDAIOCB struct:
>
> static AIOPool qed_aio_pool = {
>    .aiocb_size         = sizeof(QEDAIOCB),
>    .cancel             = qed_aio_cancel,
> };
>
> This allows QED to store per-request state in QEDAIOCB for the lifetime of a
> request:
>
> QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
>
> acb->is_write = is_write;
> acb->finished = NULL;
> acb->qiov = qiov;
> ...
>
> I suggest creating a BlockQueueAIOCB that contains the fields from
> BlockIORequest (which is no longer needed as a separate struct):
>
> typedef struct BlockQueueAIOCB {
>    BlockDriverAIOCB common;
>    BlockRequestHandler *handler;
>    int64_t sector_num;
>    QEMUIOVector *qiov;
>    int nb_sectors;
>    QTAILQ_ENTRY(BlockQueueAIOCB) entry; /* pending request queue */
> } BlockQueueAIOCB;
Its name is a bit confusing. One ACB is inserted into block queue.
>
> Now you can drop the malloc and simply qemu_aio_get() a new
> BlockQueueAIOCB.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-10  9:37       ` Stefan Hajnoczi
  2011-08-11  4:59         ` Zhi Yong Wu
  2011-08-11  5:36         ` Zhi Yong Wu
@ 2011-08-12  4:40         ` Zhi Yong Wu
  2011-08-12  4:50           ` Stefan Hajnoczi
  2 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  4:40 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, kvm, Stefan Hajnoczi, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 5:37 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> On Wed, Aug 10, 2011 at 01:54:33PM +0800, Zhi Yong Wu wrote:
>> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> >> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>> >> +                        BlockDriverState *bs,
>> >> +                        BlockRequestHandler *handler,
>> >> +                        int64_t sector_num,
>> >> +                        QEMUIOVector *qiov,
>> >> +                        int nb_sectors,
>> >> +                        BlockDriverCompletionFunc *cb,
>> >> +                        void *opaque)
>> >> +{
>> >> +    BlockIORequest *request;
>> >> +    BlockDriverAIOCB *acb;
>> >> +
>> >> +    request = qemu_malloc(sizeof(BlockIORequest));
>> >> +    request->bs = bs;
>> >> +    request->handler = handler;
>> >> +    request->sector_num = sector_num;
>> >> +    request->qiov = qiov;
>> >> +    request->nb_sectors = nb_sectors;
>> >> +    request->cb = cb;
>> >> +    request->opaque = opaque;
>> >> +
>> >> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>> >
>> > It would be simpler to define BlockQueueAIOCB and using it as our acb
>> > instead of managing an extra BlockIORequest structure.  That way you
>> > don't need to worry about extra mallocs and frees.
>> Sorry, i don't get what you mean. how to define it? Can you elaborate?
>
> BlockDriverAIOCB is designed to be embedded inside a bigger struct.  For
> example, QEDAIOCB is a larger struct that contains BlockDriverAIOCB as
> its first field:
>
> typedef struct QEDAIOCB {
>    BlockDriverAIOCB common;
>    ...
> } QEDAIOCB;
>
> And the QED AIOPool contains the size of QEDAIOCB so that qemu_aio_get() can
> allocate the full QEDAIOCB struct:
>
> static AIOPool qed_aio_pool = {
>    .aiocb_size         = sizeof(QEDAIOCB),
>    .cancel             = qed_aio_cancel,
> };
>
> This allows QED to store per-request state in QEDAIOCB for the lifetime of a
> request:
>
> QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
>
> acb->is_write = is_write;
> acb->finished = NULL;
> acb->qiov = qiov;
> ...
>
> I suggest creating a BlockQueueAIOCB that contains the fields from
> BlockIORequest (which is no longer needed as a separate struct):
>
> typedef struct BlockQueueAIOCB {
>    BlockDriverAIOCB common;
>    BlockRequestHandler *handler;
>    int64_t sector_num;
>    QEMUIOVector *qiov;
>    int nb_sectors;
>    QTAILQ_ENTRY(BlockQueueAIOCB) entry; /* pending request queue */
> } BlockQueueAIOCB;
>
> Now you can drop the malloc and simply qemu_aio_get() a new
> BlockQueueAIOCB.
Yesterday, i tried this. To be honest, i do not like this solution.:).
It results in the code logic to be a bit messy, not clear.
A AIOCB struct is treated as a block request and enqueued into block
queue. It is a bit weird. So i would like to retain BlockIORequest and
define BlockQueueAIOCB pool. This struct is used to store some useful
info:
struct BlockQueueAIOCB {
    BlockDriverAIOCB common;
    bool dispatched;
    BlockDriverAIOCB *real_acb;
    BlockIORequest *request;
};

What do you think of this?

>
> Stefan
>

-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-12  4:40         ` Zhi Yong Wu
@ 2011-08-12  4:50           ` Stefan Hajnoczi
  0 siblings, 0 replies; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-12  4:50 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, Stefan Hajnoczi, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Fri, Aug 12, 2011 at 5:40 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Wed, Aug 10, 2011 at 5:37 PM, Stefan Hajnoczi
> <stefanha@linux.vnet.ibm.com> wrote:
>> On Wed, Aug 10, 2011 at 01:54:33PM +0800, Zhi Yong Wu wrote:
>>> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> > On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>> >> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>>> >> +                        BlockDriverState *bs,
>>> >> +                        BlockRequestHandler *handler,
>>> >> +                        int64_t sector_num,
>>> >> +                        QEMUIOVector *qiov,
>>> >> +                        int nb_sectors,
>>> >> +                        BlockDriverCompletionFunc *cb,
>>> >> +                        void *opaque)
>>> >> +{
>>> >> +    BlockIORequest *request;
>>> >> +    BlockDriverAIOCB *acb;
>>> >> +
>>> >> +    request = qemu_malloc(sizeof(BlockIORequest));
>>> >> +    request->bs = bs;
>>> >> +    request->handler = handler;
>>> >> +    request->sector_num = sector_num;
>>> >> +    request->qiov = qiov;
>>> >> +    request->nb_sectors = nb_sectors;
>>> >> +    request->cb = cb;
>>> >> +    request->opaque = opaque;
>>> >> +
>>> >> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>>> >
>>> > It would be simpler to define BlockQueueAIOCB and using it as our acb
>>> > instead of managing an extra BlockIORequest structure.  That way you
>>> > don't need to worry about extra mallocs and frees.
>>> Sorry, i don't get what you mean. how to define it? Can you elaborate?
>>
>> BlockDriverAIOCB is designed to be embedded inside a bigger struct.  For
>> example, QEDAIOCB is a larger struct that contains BlockDriverAIOCB as
>> its first field:
>>
>> typedef struct QEDAIOCB {
>>    BlockDriverAIOCB common;
>>    ...
>> } QEDAIOCB;
>>
>> And the QED AIOPool contains the size of QEDAIOCB so that qemu_aio_get() can
>> allocate the full QEDAIOCB struct:
>>
>> static AIOPool qed_aio_pool = {
>>    .aiocb_size         = sizeof(QEDAIOCB),
>>    .cancel             = qed_aio_cancel,
>> };
>>
>> This allows QED to store per-request state in QEDAIOCB for the lifetime of a
>> request:
>>
>> QEDAIOCB *acb = qemu_aio_get(&qed_aio_pool, bs, cb, opaque);
>>
>> acb->is_write = is_write;
>> acb->finished = NULL;
>> acb->qiov = qiov;
>> ...
>>
>> I suggest creating a BlockQueueAIOCB that contains the fields from
>> BlockIORequest (which is no longer needed as a separate struct):
>>
>> typedef struct BlockQueueAIOCB {
>>    BlockDriverAIOCB common;
>>    BlockRequestHandler *handler;
>>    int64_t sector_num;
>>    QEMUIOVector *qiov;
>>    int nb_sectors;
>>    QTAILQ_ENTRY(BlockQueueAIOCB) entry; /* pending request queue */
>> } BlockQueueAIOCB;
>>
>> Now you can drop the malloc and simply qemu_aio_get() a new
>> BlockQueueAIOCB.
> Yesterday, i tried this. To be honest, i do not like this solution.:).
> It results in the code logic to be a bit messy, not clear.
> A AIOCB struct is treated as a block request and enqueued into block
> queue. It is a bit weird. So i would like to retain BlockIORequest and
> define BlockQueueAIOCB pool. This struct is used to store some useful
> info:
> struct BlockQueueAIOCB {
>    BlockDriverAIOCB common;
>    bool dispatched;
>    BlockDriverAIOCB *real_acb;
>    BlockIORequest *request;
> };
>
> What do you think of this?

Try it out and let's see the patch :).

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-10 11:00       ` Stefan Hajnoczi
@ 2011-08-12  5:00         ` Zhi Yong Wu
  2011-08-12  5:06           ` Stefan Hajnoczi
  0 siblings, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  5:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Wed, Aug 10, 2011 at 7:00 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>>> Note:
>>>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>>
>>> If an I/O request is larger than the limit itself then I think we
>>> should let it through with a warning that the limit is too low.  This
>> If it will print a waring, it seems to be not a nice way, the
>> throtting function will take no effect on this scenario.
>
> Setting the limit below the max request size is a misconfiguration.
> It's a problem that the user needs to be aware of and fix, so a
> warning is appropriate.  Unfortunately the max request size is not
> easy to determine from the block layer and may vary between guest
> OSes, that's why we need to do something at runtime.
>
> We cannot leave this case unhandled because it results in the guest
> timing out I/O without any information about the problem - that makes
> it hard to troubleshoot for the user.  Any other ideas on how to
> handle this case?
Can we constrain the io limits specified by the user? When the limits
is smaller than some value such as 1MB/s, one error is provide to the
user and remind he/her that the limits is too small.

>
>>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>>>  }
>>>>  #endif
>>>>
>>>> +/* throttling disk I/O limits */
>>>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>>>> +{
>>>> +    bs->io_limits_enabled = false;
>>>> +    bs->req_from_queue    = false;
>>>> +
>>>> +    if (bs->block_queue) {
>>>> +        qemu_block_queue_flush(bs->block_queue);
>>>> +        qemu_del_block_queue(bs->block_queue);
>>>
>>> When you fix the acb lifecycle in block-queue.c this will no longer be
>>> safe.  Requests that are dispatched still have acbs that belong to
>> When the request is dispatched, if it is successfully serviced, our
>> acb will been removed.
>
> Yes, that is how the patches work right now.  But the acb lifecycle
> that I explained in response to the other patch changes this.
OK.
>
>>> this queue.  It is simplest to keep the queue for the lifetime of the
>>> BlockDriverState - it's just a list so it doesn't take up much memory.
>> I more prefer to removing this queue, even though it is simple and
>> harmless to let it exist.
>
> If the existing acbs do not touch their BlockQueue after this point in
> the code, then it will be safe to delete the queue since it will no
> longer be referenced.  If you can guarantee this then it's fine to
> keep this code.
I prefer to removing the queue.
>
>>
>>>
>>>> +    }
>>>> +
>>>> +    if (bs->block_timer) {
>>>> +        qemu_del_timer(bs->block_timer);
>>>> +        qemu_free_timer(bs->block_timer);
>>>
>>> To prevent double frees:
>>>
>>> bs->block_timer = NULL;
>> Can qemu_free_timer() not ensure that it is NULL? This is not necessary.:)
>
> No, qemu_free_timer() is a function.  qemu_free_timer() cannot change
> the pointer variable bs->block_timer because in C functions take
> arguments by value.
>
> After qemu_free_timer() bs->block_timer will still hold the old
> address of the (freed) timer.  Then when bdrv_close() is called it
> does qemu_free_timer() again because bs->block_timer is not NULL.  It
> is illegal to free() a pointer twice and it can cause a crash.
Done.
>
>>
>>>
>>>> +    }
>>>> +
>>>> +    bs->slice_start[0]   = 0;
>>>> +    bs->slice_start[1]   = 0;
>>>> +
>>>> +    bs->slice_end[0]     = 0;
>>>> +    bs->slice_end[1]     = 0;
>>>> +}
>>>> +
>>>> +static void bdrv_block_timer(void *opaque)
>>>> +{
>>>> +    BlockDriverState *bs = opaque;
>>>> +    BlockQueue *queue = bs->block_queue;
>>>> +
>>>> +    qemu_block_queue_flush(queue);
>>>> +}
>>>> +
>>>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>>>> +{
>>>> +    bs->req_from_queue = false;
>>>> +
>>>> +    bs->block_queue    = qemu_new_block_queue();
>>>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>>>> +
>>>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>>>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>>>> +
>>>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>>
>>> The slice times could just be initialized to 0 here.  When
>>> bdrv_exceed_io_limits() is called it will start a new slice.
>> The result should be same as current method even though they are set to ZERO.
>
> Yes, the result is the same except we only create new slices in one place.
Done.
>
>>>> +            }
>>>> +
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    elapsed_time  = real_time - bs->slice_start[is_write];
>>>> +    elapsed_time  /= (BLOCK_IO_SLICE_TIME * 10.0);
>>>
>>> Why * 10.0?
>> elapsed_time currently is in ns, and it need to be translated into a
>> floating value in minutes.
>
> I think you mean seconds instead of minutes.  Then the conversion has
> nothing to do with BLOCK_IO_SLICE_TIME, it's just nanoseconds to
> seconds.  It would be clearer to drop BLOCK_IO_SLICE_TIME from the
> expression and divide by a new NANOSECONDS_PER_SECOND constant
> (1000000000.0).
Done
>
>>>
>>>> +
>>>> +    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
>>>> +                                      is_write, elapsed_time, &bps_wait);
>>>> +    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
>>>> +                                      elapsed_time, &iops_wait);
>>>> +    if (bps_ret || iops_ret) {
>>>> +        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
>>>> +        if (wait) {
>>>> +            *wait = max_wait;
>>>> +        }
>>>> +
>>>> +        real_time = qemu_get_clock_ns(vm_clock);
>>>> +        if (bs->slice_end[is_write] < real_time + max_wait) {
>>>> +            bs->slice_end[is_write] = real_time + max_wait;
>>>> +        }
>>>
>>> Why is this necessary?  We have already extended the slice at the top
>>> of the function.
>> Here it will adjust this end of slice time based on
>> bdrv_exceed_bps_limits(bdrv_exceed_iops_limits).
>> This will be more accurate.
>
> After rereading it I realized that this extends the slice up to the
> estimated dispatch time (max_wait) and not just by
> BLOCK_IO_SLICE_TIME.  This code now makes sense to me: we're trying to
> keep slices around while there are queued requests so that
> iops/throughput history is not forgotten when queued requests are
> dispatched.
Right.
>
>>>
>>>> +
>>>> +        return true;
>>>> +    }
>>>> +
>>>> +    if (wait) {
>>>> +        *wait = 0;
>>>> +    }
>>>> +
>>>> +    return false;
>>>> +}
>>>>
>>>>  /**************************************************************/
>>>>  /* async I/Os */
>>>> @@ -2121,13 +2398,28 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
>>>>  {
>>>>     BlockDriver *drv = bs->drv;
>>>>     BlockDriverAIOCB *ret;
>>>> +    uint64_t wait_time = 0;
>>>>
>>>>     trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque);
>>>>
>>>> -    if (!drv)
>>>> -        return NULL;
>>>> -    if (bdrv_check_request(bs, sector_num, nb_sectors))
>>>> +    if (!drv || bdrv_check_request(bs, sector_num, nb_sectors)) {
>>>> +        if (bs->io_limits_enabled) {
>>>> +            bs->req_from_queue = false;
>>>> +        }
>>>>         return NULL;
>>>> +    }
>>>> +
>>>> +    /* throttling disk read I/O */
>>>> +    if (bs->io_limits_enabled) {
>>>> +        if (bdrv_exceed_io_limits(bs, nb_sectors, false, &wait_time)) {
>>>> +            ret = qemu_block_queue_enqueue(bs->block_queue, bs, bdrv_aio_readv,
>>>> +                              sector_num, qiov, nb_sectors, cb, opaque);
>>>> +            qemu_mod_timer(bs->block_timer,
>>>> +                       wait_time + qemu_get_clock_ns(vm_clock));
>>>
>>> If a guest keeps sending I/O requests, say every millisecond, then we
>>> will delay dispatch forever.  In practice we hope the storage
>>> interface (virtio-blk, LSI SCSI, etc) has a resource limit that
>>> prevents this.  But the point remains that this delays requests that
>>> should be dispatched for too long.  Once the timer has been set we
>>> should not set it again.
>> Can you elaborate this?
>
> Let's wait until the next version of the patch, I've suggested many
> changes and should comment against current code instead of the way I
> think it is going to work :).
OK.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-12  5:00         ` Zhi Yong Wu
@ 2011-08-12  5:06           ` Stefan Hajnoczi
  2011-08-12  5:23             ` Zhi Yong Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-12  5:06 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Fri, Aug 12, 2011 at 6:00 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Wed, Aug 10, 2011 at 7:00 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>>> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>>>> Note:
>>>>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>>>
>>>> If an I/O request is larger than the limit itself then I think we
>>>> should let it through with a warning that the limit is too low.  This
>>> If it will print a waring, it seems to be not a nice way, the
>>> throtting function will take no effect on this scenario.
>>
>> Setting the limit below the max request size is a misconfiguration.
>> It's a problem that the user needs to be aware of and fix, so a
>> warning is appropriate.  Unfortunately the max request size is not
>> easy to determine from the block layer and may vary between guest
>> OSes, that's why we need to do something at runtime.
>>
>> We cannot leave this case unhandled because it results in the guest
>> timing out I/O without any information about the problem - that makes
>> it hard to troubleshoot for the user.  Any other ideas on how to
>> handle this case?
> Can we constrain the io limits specified by the user? When the limits
> is smaller than some value such as 1MB/s, one error is provide to the
> user and remind he/her that the limits is too small.

It would be an arbitrary limit because the maximum request size
depends on the storage interface or on the host block device.  Guest
OSes may limit the max request size further.  There is no single
constant that we can choose ahead of time.  Any constant value risks
not allowing the user to set a small valid limit or being less than
the max request size and therefore causing I/O to stall again.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-12  5:06           ` Stefan Hajnoczi
@ 2011-08-12  5:23             ` Zhi Yong Wu
  0 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  5:23 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, pair, stefanha, kvm, mtosatti, qemu-devel, ryanh,
	Zhi Yong Wu, luowenj

On Fri, Aug 12, 2011 at 1:06 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Aug 12, 2011 at 6:00 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Wed, Aug 10, 2011 at 7:00 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Wed, Aug 10, 2011 at 7:57 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>>>> On Tue, Aug 9, 2011 at 11:19 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>>>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>>>>> Note:
>>>>>>      1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>>>>
>>>>> If an I/O request is larger than the limit itself then I think we
>>>>> should let it through with a warning that the limit is too low.  This
>>>> If it will print a waring, it seems to be not a nice way, the
>>>> throtting function will take no effect on this scenario.
>>>
>>> Setting the limit below the max request size is a misconfiguration.
>>> It's a problem that the user needs to be aware of and fix, so a
>>> warning is appropriate.  Unfortunately the max request size is not
>>> easy to determine from the block layer and may vary between guest
>>> OSes, that's why we need to do something at runtime.
>>>
>>> We cannot leave this case unhandled because it results in the guest
>>> timing out I/O without any information about the problem - that makes
>>> it hard to troubleshoot for the user.  Any other ideas on how to
>>> handle this case?
>> Can we constrain the io limits specified by the user? When the limits
>> is smaller than some value such as 1MB/s, one error is provide to the
>> user and remind he/her that the limits is too small.
>
> It would be an arbitrary limit because the maximum request size
> depends on the storage interface or on the host block device.  Guest
> OSes may limit the max request size further.  There is no single
> constant that we can choose ahead of time.  Any constant value risks
> not allowing the user to set a small valid limit or being less than
> the max request size and therefore causing I/O to stall again.
Let us think at first. When anyone of us has some better idea, we will
rediscuss this.:)
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-09  8:57   ` Ram Pai
  2011-08-09  9:06     ` Zhi Yong Wu
@ 2011-08-12  5:35     ` Zhi Yong Wu
  2011-08-12  5:47       ` Stefan Hajnoczi
  1 sibling, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  5:35 UTC (permalink / raw)
  To: Ram Pai
  Cc: kwolf, stefanha, kvm, mtosatti, qemu-devel, ryanh, Zhi Yong Wu,
	luowenj

On Tue, Aug 9, 2011 at 4:57 PM, Ram Pai <linuxram@us.ibm.com> wrote:
> On Tue, Aug 09, 2011 at 12:17:51PM +0800, Zhi Yong Wu wrote:
>> Note:
>>       1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>       2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits.
>>
>> For these problems, if you have nice thought, pls let us know.:)
>>
>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>> ---
>>  block.c     |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>  block.h     |    6 +-
>>  block_int.h |   30 +++++
>>  3 files changed, 372 insertions(+), 11 deletions(-)
>>
>> diff --git a/block.c b/block.c
>> index 24a25d5..8fd6643 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -29,6 +29,9 @@
>>  #include "module.h"
>>  #include "qemu-objects.h"
>>
>> +#include "qemu-timer.h"
>> +#include "block/blk-queue.h"
>> +
>>  #ifdef CONFIG_BSD
>>  #include <sys/types.h>
>>  #include <sys/stat.h>
>> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>>                           const uint8_t *buf, int nb_sectors);
>>
>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>> +        bool is_write, double elapsed_time, uint64_t *wait);
>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>> +        double elapsed_time, uint64_t *wait);
>> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
>> +        bool is_write, uint64_t *wait);
>> +
>>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>>      QTAILQ_HEAD_INITIALIZER(bdrv_states);
>>
>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>  }
>>  #endif
>>
>> +/* throttling disk I/O limits */
>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>> +{
>> +    bs->io_limits_enabled = false;
>> +    bs->req_from_queue    = false;
>> +
>> +    if (bs->block_queue) {
>> +        qemu_block_queue_flush(bs->block_queue);
>> +        qemu_del_block_queue(bs->block_queue);
>> +    }
>> +
>> +    if (bs->block_timer) {
>> +        qemu_del_timer(bs->block_timer);
>> +        qemu_free_timer(bs->block_timer);
>> +    }
>> +
>> +    bs->slice_start[0]   = 0;
>> +    bs->slice_start[1]   = 0;
>> +
>> +    bs->slice_end[0]     = 0;
>> +    bs->slice_end[1]     = 0;
>> +}
>> +
>> +static void bdrv_block_timer(void *opaque)
>> +{
>> +    BlockDriverState *bs = opaque;
>> +    BlockQueue *queue = bs->block_queue;
>> +
>> +    qemu_block_queue_flush(queue);
>> +}
>> +
>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>> +{
>> +    bs->req_from_queue = false;
>> +
>> +    bs->block_queue    = qemu_new_block_queue();
>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>> +
>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>
> a minor comment. better to keep the slice_start of the both the READ and WRITE
> side the same.
>
>    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_READ];
>
> saves  a call to qemu_get_clock_ns().
>
>> +
>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>
>    bs->slice_end[BLOCK_IO_LIMIT_READ] = bs->slice_start[BLOCK_IO_LIMIT_READ] +
>                        BLOCK_IO_SLICE_TIME;
>
> saves one more call to qemu_get_clock_ns()
>
>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>
>
>    bs->slice_end[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_WRITE] +
>                        BLOCK_IO_SLICE_TIME;
>
> yet another call saving.
>
>
>> +}
>> +
>> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
>> +{
>> +    BlockIOLimit *io_limits = &bs->io_limits;
>> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
>> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
>> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
>> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>
> can be optimized to:
>
>        return (io_limits->bps[BLOCK_IO_LIMIT_READ]
>                || io_limits->bps[BLOCK_IO_LIMIT_WRITE]
>                || io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
>                || io_limits->iops[BLOCK_IO_LIMIT_READ]
>                || io_limits->iops[BLOCK_IO_LIMIT_WRITE]
>                || io_limits->iops[BLOCK_IO_LIMIT_TOTAL]);
I want to apply this, but it violate qemu coding styles.
>
>> +
>>  /* check if the path starts with "<protocol>:" */
>>  static int path_has_protocol(const char *path)
>>  {
>> @@ -642,6 +714,11 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>>              bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>>      }
>>
>> +    /* throttling disk I/O limits */
>> +    if (bs->io_limits_enabled) {
>> +        bdrv_io_limits_enable(bs);
>> +    }
>> +
>>      return 0;
>>
>>  unlink_and_fail:
>> @@ -680,6 +757,16 @@ void bdrv_close(BlockDriverState *bs)
>>          if (bs->change_cb)
>>              bs->change_cb(bs->change_opaque, CHANGE_MEDIA);
>>      }
>> +
>> +    /* throttling disk I/O limits */
>> +    if (bs->block_queue) {
>> +        qemu_del_block_queue(bs->block_queue);
>> +    }
>> +
>> +    if (bs->block_timer) {
>> +        qemu_del_timer(bs->block_timer);
>> +        qemu_free_timer(bs->block_timer);
>> +    }
>>  }
>>
>>  void bdrv_close_all(void)
>> @@ -1312,6 +1399,15 @@ void bdrv_get_geometry_hint(BlockDriverState *bs,
>>      *psecs = bs->secs;
>>  }
>>
>> +/* throttling disk io limits */
>> +void bdrv_set_io_limits(BlockDriverState *bs,
>> +                            BlockIOLimit *io_limits)
>> +{
>> +    memset(&bs->io_limits, 0, sizeof(BlockIOLimit));
>> +    bs->io_limits = *io_limits;
>> +    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
>> +}
>> +
>>  /* Recognize floppy formats */
>>  typedef struct FDFormat {
>>      FDriveType drive;
>> @@ -1707,6 +1803,16 @@ static void bdrv_print_dict(QObject *obj, void *opaque)
>>                              qdict_get_bool(qdict, "ro"),
>>                              qdict_get_str(qdict, "drv"),
>>                              qdict_get_bool(qdict, "encrypted"));
>> +
>> +        monitor_printf(mon, " bps=%" PRId64 " bps_rd=%" PRId64
>> +                            " bps_wr=%" PRId64 " iops=%" PRId64
>> +                            " iops_rd=%" PRId64 " iops_wr=%" PRId64,
>> +                            qdict_get_int(qdict, "bps"),
>> +                            qdict_get_int(qdict, "bps_rd"),
>> +                            qdict_get_int(qdict, "bps_wr"),
>> +                            qdict_get_int(qdict, "iops"),
>> +                            qdict_get_int(qdict, "iops_rd"),
>> +                            qdict_get_int(qdict, "iops_wr"));
>>      } else {
>>          monitor_printf(mon, " [not inserted]");
>>      }
>> @@ -1739,10 +1845,22 @@ void bdrv_info(Monitor *mon, QObject **ret_data)
>>              QDict *bs_dict = qobject_to_qdict(bs_obj);
>>
>>              obj = qobject_from_jsonf("{ 'file': %s, 'ro': %i, 'drv': %s, "
>> -                                     "'encrypted': %i }",
>> +                                     "'encrypted': %i, "
>> +                                     "'bps': %" PRId64 ","
>> +                                     "'bps_rd': %" PRId64 ","
>> +                                     "'bps_wr': %" PRId64 ","
>> +                                     "'iops': %" PRId64 ","
>> +                                     "'iops_rd': %" PRId64 ","
>> +                                     "'iops_wr': %" PRId64 "}",
>>                                       bs->filename, bs->read_only,
>>                                       bs->drv->format_name,
>> -                                     bdrv_is_encrypted(bs));
>> +                                     bdrv_is_encrypted(bs),
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL],
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_READ],
>> +                                     bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_READ],
>> +                                     bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE]);
>>              if (bs->backing_file[0] != '\0') {
>>                  QDict *qdict = qobject_to_qdict(obj);
>>                  qdict_put(qdict, "backing_file",
>> @@ -2111,6 +2229,165 @@ char *bdrv_snapshot_dump(char *buf, int buf_size, QEMUSnapshotInfo *sn)
>>      return buf;
>>  }
>>
>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>> +                 bool is_write, double elapsed_time, uint64_t *wait) {
>> +    uint64_t bps_limit = 0;
>> +    double   bytes_limit, bytes_disp, bytes_res;
>> +    double   slice_time, wait_time;
>> +
>> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
>> +        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
>> +    } else if (bs->io_limits.bps[is_write]) {
>> +        bps_limit = bs->io_limits.bps[is_write];
>> +    } else {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
>> +    bytes_limit = bps_limit * slice_time;
>> +    bytes_disp  = bs->io_disps.bytes[is_write];
>> +    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
>> +        bytes_disp += bs->io_disps.bytes[!is_write];
>> +    }
>> +
>> +    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
>> +
>> +    if (bytes_disp + bytes_res <= bytes_limit) {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    /* Calc approx time to dispatch */
>> +    wait_time = (bytes_disp + bytes_res) / bps_limit - elapsed_time;
>> +
>> +    if (wait) {
>> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>> +                             double elapsed_time, uint64_t *wait) {
>> +    uint64_t iops_limit = 0;
>> +    double   ios_limit, ios_disp;
>> +    double   slice_time, wait_time;
>> +
>> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
>> +    } else if (bs->io_limits.iops[is_write]) {
>> +        iops_limit = bs->io_limits.iops[is_write];
>> +    } else {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    slice_time = bs->slice_end[is_write] - bs->slice_start[is_write];
>> +    slice_time /= (BLOCK_IO_SLICE_TIME * 10.0);
>> +    ios_limit  = iops_limit * slice_time;
>> +    ios_disp   = bs->io_disps.ios[is_write];
>> +    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
>> +        ios_disp += bs->io_disps.ios[!is_write];
>> +    }
>> +
>> +    if (ios_disp + 1 <= ios_limit) {
>> +        if (wait) {
>> +            *wait = 0;
>> +        }
>> +
>> +        return false;
>> +    }
>> +
>> +    /* Calc approx time to dispatch */
>> +    wait_time = (ios_disp + 1) / iops_limit;
>> +    if (wait_time > elapsed_time) {
>> +        wait_time = wait_time - elapsed_time;
>> +    } else {
>> +        wait_time = 0;
>> +    }
>> +
>> +    if (wait) {
>> +        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>> +    }
>> +
>> +    return true;
>> +}
>
> bdrv_exceed_bps_limits() and bdrv_exceed_iops_limits() has almost similar logic.
> Probably can be abstracted into a single function.
>
> RP
>
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-12  5:35     ` Zhi Yong Wu
@ 2011-08-12  5:47       ` Stefan Hajnoczi
  2011-08-12  6:00         ` Zhi Yong Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-12  5:47 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, stefanha, kvm, mtosatti, Ram Pai, qemu-devel, ryanh,
	luowenj, Zhi Yong Wu

On Fri, Aug 12, 2011 at 6:35 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 4:57 PM, Ram Pai <linuxram@us.ibm.com> wrote:
>> On Tue, Aug 09, 2011 at 12:17:51PM +0800, Zhi Yong Wu wrote:
>>> Note:
>>>       1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>>       2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits.
>>>
>>> For these problems, if you have nice thought, pls let us know.:)
>>>
>>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>> ---
>>>  block.c     |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>  block.h     |    6 +-
>>>  block_int.h |   30 +++++
>>>  3 files changed, 372 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index 24a25d5..8fd6643 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -29,6 +29,9 @@
>>>  #include "module.h"
>>>  #include "qemu-objects.h"
>>>
>>> +#include "qemu-timer.h"
>>> +#include "block/blk-queue.h"
>>> +
>>>  #ifdef CONFIG_BSD
>>>  #include <sys/types.h>
>>>  #include <sys/stat.h>
>>> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>>>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>>>                           const uint8_t *buf, int nb_sectors);
>>>
>>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>>> +        bool is_write, double elapsed_time, uint64_t *wait);
>>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>>> +        double elapsed_time, uint64_t *wait);
>>> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
>>> +        bool is_write, uint64_t *wait);
>>> +
>>>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>>>      QTAILQ_HEAD_INITIALIZER(bdrv_states);
>>>
>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>>  }
>>>  #endif
>>>
>>> +/* throttling disk I/O limits */
>>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>>> +{
>>> +    bs->io_limits_enabled = false;
>>> +    bs->req_from_queue    = false;
>>> +
>>> +    if (bs->block_queue) {
>>> +        qemu_block_queue_flush(bs->block_queue);
>>> +        qemu_del_block_queue(bs->block_queue);
>>> +    }
>>> +
>>> +    if (bs->block_timer) {
>>> +        qemu_del_timer(bs->block_timer);
>>> +        qemu_free_timer(bs->block_timer);
>>> +    }
>>> +
>>> +    bs->slice_start[0]   = 0;
>>> +    bs->slice_start[1]   = 0;
>>> +
>>> +    bs->slice_end[0]     = 0;
>>> +    bs->slice_end[1]     = 0;
>>> +}
>>> +
>>> +static void bdrv_block_timer(void *opaque)
>>> +{
>>> +    BlockDriverState *bs = opaque;
>>> +    BlockQueue *queue = bs->block_queue;
>>> +
>>> +    qemu_block_queue_flush(queue);
>>> +}
>>> +
>>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>>> +{
>>> +    bs->req_from_queue = false;
>>> +
>>> +    bs->block_queue    = qemu_new_block_queue();
>>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>>> +
>>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>>
>> a minor comment. better to keep the slice_start of the both the READ and WRITE
>> side the same.
>>
>>    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_READ];
>>
>> saves  a call to qemu_get_clock_ns().
>>
>>> +
>>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>
>>    bs->slice_end[BLOCK_IO_LIMIT_READ] = bs->slice_start[BLOCK_IO_LIMIT_READ] +
>>                        BLOCK_IO_SLICE_TIME;
>>
>> saves one more call to qemu_get_clock_ns()
>>
>>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>
>>
>>    bs->slice_end[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_WRITE] +
>>                        BLOCK_IO_SLICE_TIME;
>>
>> yet another call saving.
>>
>>
>>> +}
>>> +
>>> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
>>> +{
>>> +    BlockIOLimit *io_limits = &bs->io_limits;
>>> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
>>> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
>>> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
>>> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
>>> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
>>> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    return true;
>>> +}
>>
>> can be optimized to:
>>
>>        return (io_limits->bps[BLOCK_IO_LIMIT_READ]
>>                || io_limits->bps[BLOCK_IO_LIMIT_WRITE]
>>                || io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
>>                || io_limits->iops[BLOCK_IO_LIMIT_READ]
>>                || io_limits->iops[BLOCK_IO_LIMIT_WRITE]
>>                || io_limits->iops[BLOCK_IO_LIMIT_TOTAL]);
> I want to apply this, but it violate qemu coding styles.

Perhaps checkpatch.pl complains because of the (...) around the return
value.  Try removing them.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm
  2011-08-12  5:47       ` Stefan Hajnoczi
@ 2011-08-12  6:00         ` Zhi Yong Wu
  0 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  6:00 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, stefanha, kvm, mtosatti, Ram Pai, qemu-devel, ryanh,
	luowenj, Zhi Yong Wu

On Fri, Aug 12, 2011 at 1:47 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Aug 12, 2011 at 6:35 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Tue, Aug 9, 2011 at 4:57 PM, Ram Pai <linuxram@us.ibm.com> wrote:
>>> On Tue, Aug 09, 2011 at 12:17:51PM +0800, Zhi Yong Wu wrote:
>>>> Note:
>>>>       1.) When bps/iops limits are specified to a small value such as 511 bytes/s, this VM will hang up. We are considering how to handle this senario.
>>>>       2.) When "dd" command is issued in guest, if its option bs is set to a large value such as "bs=1024K", the result speed will slightly bigger than the limits.
>>>>
>>>> For these problems, if you have nice thought, pls let us know.:)
>>>>
>>>> Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
>>>> ---
>>>>  block.c     |  347 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>>>>  block.h     |    6 +-
>>>>  block_int.h |   30 +++++
>>>>  3 files changed, 372 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/block.c b/block.c
>>>> index 24a25d5..8fd6643 100644
>>>> --- a/block.c
>>>> +++ b/block.c
>>>> @@ -29,6 +29,9 @@
>>>>  #include "module.h"
>>>>  #include "qemu-objects.h"
>>>>
>>>> +#include "qemu-timer.h"
>>>> +#include "block/blk-queue.h"
>>>> +
>>>>  #ifdef CONFIG_BSD
>>>>  #include <sys/types.h>
>>>>  #include <sys/stat.h>
>>>> @@ -58,6 +61,13 @@ static int bdrv_read_em(BlockDriverState *bs, int64_t sector_num,
>>>>  static int bdrv_write_em(BlockDriverState *bs, int64_t sector_num,
>>>>                           const uint8_t *buf, int nb_sectors);
>>>>
>>>> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
>>>> +        bool is_write, double elapsed_time, uint64_t *wait);
>>>> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>>>> +        double elapsed_time, uint64_t *wait);
>>>> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
>>>> +        bool is_write, uint64_t *wait);
>>>> +
>>>>  static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
>>>>      QTAILQ_HEAD_INITIALIZER(bdrv_states);
>>>>
>>>> @@ -90,6 +100,68 @@ int is_windows_drive(const char *filename)
>>>>  }
>>>>  #endif
>>>>
>>>> +/* throttling disk I/O limits */
>>>> +void bdrv_io_limits_disable(BlockDriverState *bs)
>>>> +{
>>>> +    bs->io_limits_enabled = false;
>>>> +    bs->req_from_queue    = false;
>>>> +
>>>> +    if (bs->block_queue) {
>>>> +        qemu_block_queue_flush(bs->block_queue);
>>>> +        qemu_del_block_queue(bs->block_queue);
>>>> +    }
>>>> +
>>>> +    if (bs->block_timer) {
>>>> +        qemu_del_timer(bs->block_timer);
>>>> +        qemu_free_timer(bs->block_timer);
>>>> +    }
>>>> +
>>>> +    bs->slice_start[0]   = 0;
>>>> +    bs->slice_start[1]   = 0;
>>>> +
>>>> +    bs->slice_end[0]     = 0;
>>>> +    bs->slice_end[1]     = 0;
>>>> +}
>>>> +
>>>> +static void bdrv_block_timer(void *opaque)
>>>> +{
>>>> +    BlockDriverState *bs = opaque;
>>>> +    BlockQueue *queue = bs->block_queue;
>>>> +
>>>> +    qemu_block_queue_flush(queue);
>>>> +}
>>>> +
>>>> +void bdrv_io_limits_enable(BlockDriverState *bs)
>>>> +{
>>>> +    bs->req_from_queue = false;
>>>> +
>>>> +    bs->block_queue    = qemu_new_block_queue();
>>>> +    bs->block_timer    = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
>>>> +
>>>> +    bs->slice_start[BLOCK_IO_LIMIT_READ]  = qemu_get_clock_ns(vm_clock);
>>>> +    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = qemu_get_clock_ns(vm_clock);
>>>
>>> a minor comment. better to keep the slice_start of the both the READ and WRITE
>>> side the same.
>>>
>>>    bs->slice_start[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_READ];
>>>
>>> saves  a call to qemu_get_clock_ns().
>>>
>>>> +
>>>> +    bs->slice_end[BLOCK_IO_LIMIT_READ]    =
>>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>>
>>>    bs->slice_end[BLOCK_IO_LIMIT_READ] = bs->slice_start[BLOCK_IO_LIMIT_READ] +
>>>                        BLOCK_IO_SLICE_TIME;
>>>
>>> saves one more call to qemu_get_clock_ns()
>>>
>>>> +    bs->slice_end[BLOCK_IO_LIMIT_WRITE]   =
>>>> +                      qemu_get_clock_ns(vm_clock) + BLOCK_IO_SLICE_TIME;
>>>
>>>
>>>    bs->slice_end[BLOCK_IO_LIMIT_WRITE] = bs->slice_start[BLOCK_IO_LIMIT_WRITE] +
>>>                        BLOCK_IO_SLICE_TIME;
>>>
>>> yet another call saving.
>>>
>>>
>>>> +}
>>>> +
>>>> +bool bdrv_io_limits_enabled(BlockDriverState *bs)
>>>> +{
>>>> +    BlockIOLimit *io_limits = &bs->io_limits;
>>>> +    if ((io_limits->bps[BLOCK_IO_LIMIT_READ] == 0)
>>>> +         && (io_limits->bps[BLOCK_IO_LIMIT_WRITE] == 0)
>>>> +         && (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] == 0)
>>>> +         && (io_limits->iops[BLOCK_IO_LIMIT_READ] == 0)
>>>> +         && (io_limits->iops[BLOCK_IO_LIMIT_WRITE] == 0)
>>>> +         && (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] == 0)) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>
>>> can be optimized to:
>>>
>>>        return (io_limits->bps[BLOCK_IO_LIMIT_READ]
>>>                || io_limits->bps[BLOCK_IO_LIMIT_WRITE]
>>>                || io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
>>>                || io_limits->iops[BLOCK_IO_LIMIT_READ]
>>>                || io_limits->iops[BLOCK_IO_LIMIT_WRITE]
>>>                || io_limits->iops[BLOCK_IO_LIMIT_TOTAL]);
>> I want to apply this, but it violate qemu coding styles.
>
> Perhaps checkpatch.pl complains because of the (...) around the return
> value.  Try removing them.
After i removed the parentheses, it can work now. thanks.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-09 12:49   ` Stefan Hajnoczi
  2011-08-10  5:54     ` Zhi Yong Wu
@ 2011-08-12  8:10     ` Zhi Yong Wu
  2011-08-12  8:42       ` Stefan Hajnoczi
  1 sibling, 1 reply; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  8:10 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, stefanha, kvm, mtosatti, qemu-devel, ryanh, Zhi Yong Wu

On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>> +/* The APIs for block request queue on qemu block layer.
>> + */
>> +
>> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
>> +{
>> +    qemu_aio_release(acb);
>> +}
>> +
>> +static AIOPool block_queue_pool = {
>> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
>> +    .cancel             = qemu_block_queue_cancel,
>> +};
>
> The lifecycle of block_queue_pool acbs should be like this:
>
> When a request is queued we need a BlockQueue acb because we have no
> real acb yet.  So we get an acb from block_queue_pool.
>
> If the acb is cancelled before being dispatched we need to release the
> acb and remove the request from the queue.  (This patch does not
> remove the request from the queue on cancel.)
>
> When the acb is dispatched we need to keep track of the real acb (e.g.
> from qcow2).  The caller will only ever see our acb because there is
> no way to give them the pointer to the new acb after dispatch.  That
> means we need to keep the our acb alive for the entire lifetime of the
> request.  (This patch currently releases our acb when the request is
> dispatched.)
>
> If the acb is cancelled after being dispatched we need to first cancel
> the real acb and then release our acb.
>
> When the acb is dispatched we need to pass qemu_block_queue_callback
> as the cb function to handler.  Inside qemu_block_queue_callback we
When one block request have been reenqueued multiple times, it will be
difficult to store original cb function.

> invoke the request cb and then release our acb.  This is how our acb
> should be freed.
>
>> +BlockDriverAIOCB *qemu_block_queue_enqueue(BlockQueue *queue,
>> +                        BlockDriverState *bs,
>> +                        BlockRequestHandler *handler,
>> +                        int64_t sector_num,
>> +                        QEMUIOVector *qiov,
>> +                        int nb_sectors,
>> +                        BlockDriverCompletionFunc *cb,
>> +                        void *opaque)
>> +{
>> +    BlockIORequest *request;
>> +    BlockDriverAIOCB *acb;
>> +
>> +    request = qemu_malloc(sizeof(BlockIORequest));
>> +    request->bs = bs;
>> +    request->handler = handler;
>> +    request->sector_num = sector_num;
>> +    request->qiov = qiov;
>> +    request->nb_sectors = nb_sectors;
>> +    request->cb = cb;
>> +    request->opaque = opaque;
>> +
>> +    QTAILQ_INSERT_TAIL(&queue->requests, request, entry);
>
> It would be simpler to define BlockQueueAIOCB and using it as our acb
> instead of managing an extra BlockIORequest structure.  That way you
> don't need to worry about extra mallocs and frees.
>
>> +
>> +    acb = qemu_aio_get(&block_queue_pool, bs,
>> +                       qemu_block_queue_callback, opaque);
>> +
>> +    request->acb = acb;
>> +
>> +    return acb;
>> +}
>> +
>> +int qemu_block_queue_handler(BlockIORequest *request)
>
> This function can be static.
>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-12  8:10     ` Zhi Yong Wu
@ 2011-08-12  8:42       ` Stefan Hajnoczi
  2011-08-12  9:11         ` Zhi Yong Wu
  0 siblings, 1 reply; 36+ messages in thread
From: Stefan Hajnoczi @ 2011-08-12  8:42 UTC (permalink / raw)
  To: Zhi Yong Wu
  Cc: kwolf, stefanha, kvm, mtosatti, qemu-devel, ryanh, Zhi Yong Wu

On Fri, Aug 12, 2011 at 9:10 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>> +/* The APIs for block request queue on qemu block layer.
>>> + */
>>> +
>>> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
>>> +{
>>> +    qemu_aio_release(acb);
>>> +}
>>> +
>>> +static AIOPool block_queue_pool = {
>>> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
>>> +    .cancel             = qemu_block_queue_cancel,
>>> +};
>>
>> The lifecycle of block_queue_pool acbs should be like this:
>>
>> When a request is queued we need a BlockQueue acb because we have no
>> real acb yet.  So we get an acb from block_queue_pool.
>>
>> If the acb is cancelled before being dispatched we need to release the
>> acb and remove the request from the queue.  (This patch does not
>> remove the request from the queue on cancel.)
>>
>> When the acb is dispatched we need to keep track of the real acb (e.g.
>> from qcow2).  The caller will only ever see our acb because there is
>> no way to give them the pointer to the new acb after dispatch.  That
>> means we need to keep the our acb alive for the entire lifetime of the
>> request.  (This patch currently releases our acb when the request is
>> dispatched.)
>>
>> If the acb is cancelled after being dispatched we need to first cancel
>> the real acb and then release our acb.
>>
>> When the acb is dispatched we need to pass qemu_block_queue_callback
>> as the cb function to handler.  Inside qemu_block_queue_callback we
> When one block request have been reenqueued multiple times, it will be
> difficult to store original cb function.

The challenge is that blkqueue creates an acb and returns it to the
user when a request is enqueued.  Because the user now has the acb we
must keep this acb around for the lifetime of the request.

The current code will create a new acb if the request is enqueued
again.  We should avoid this.

The code needs to be structured so that a queued acb can be dispatched
directly - without creating a new BlockQueueAIOCB.

Stefan

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [Qemu-devel] [PATCH v5 2/4] block: add the block queue support
  2011-08-12  8:42       ` Stefan Hajnoczi
@ 2011-08-12  9:11         ` Zhi Yong Wu
  0 siblings, 0 replies; 36+ messages in thread
From: Zhi Yong Wu @ 2011-08-12  9:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kwolf, stefanha, kvm, mtosatti, qemu-devel, ryanh, Zhi Yong Wu

On Fri, Aug 12, 2011 at 4:42 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Fri, Aug 12, 2011 at 9:10 AM, Zhi Yong Wu <zwu.kernel@gmail.com> wrote:
>> On Tue, Aug 9, 2011 at 8:49 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>>> On Tue, Aug 9, 2011 at 5:17 AM, Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> wrote:
>>>> +/* The APIs for block request queue on qemu block layer.
>>>> + */
>>>> +
>>>> +static void qemu_block_queue_cancel(BlockDriverAIOCB *acb)
>>>> +{
>>>> +    qemu_aio_release(acb);
>>>> +}
>>>> +
>>>> +static AIOPool block_queue_pool = {
>>>> +    .aiocb_size         = sizeof(struct BlockDriverAIOCB),
>>>> +    .cancel             = qemu_block_queue_cancel,
>>>> +};
>>>
>>> The lifecycle of block_queue_pool acbs should be like this:
>>>
>>> When a request is queued we need a BlockQueue acb because we have no
>>> real acb yet.  So we get an acb from block_queue_pool.
>>>
>>> If the acb is cancelled before being dispatched we need to release the
>>> acb and remove the request from the queue.  (This patch does not
>>> remove the request from the queue on cancel.)
>>>
>>> When the acb is dispatched we need to keep track of the real acb (e.g.
>>> from qcow2).  The caller will only ever see our acb because there is
>>> no way to give them the pointer to the new acb after dispatch.  That
>>> means we need to keep the our acb alive for the entire lifetime of the
>>> request.  (This patch currently releases our acb when the request is
>>> dispatched.)
>>>
>>> If the acb is cancelled after being dispatched we need to first cancel
>>> the real acb and then release our acb.
>>>
>>> When the acb is dispatched we need to pass qemu_block_queue_callback
>>> as the cb function to handler.  Inside qemu_block_queue_callback we
>> When one block request have been reenqueued multiple times, it will be
>> difficult to store original cb function.
>
> The challenge is that blkqueue creates an acb and returns it to the
> user when a request is enqueued.  Because the user now has the acb we
> must keep this acb around for the lifetime of the request.
>
> The current code will create a new acb if the request is enqueued
> again.  We should avoid this.
>
> The code needs to be structured so that a queued acb can be dispatched
> directly - without creating a new BlockQueueAIOCB.
Maybe we can workaround this issue by temp stashing our acb to one
field of block queue.
pls see my public git tree.:)

>
> Stefan
>



-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2011-08-12  9:11 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-09  4:17 [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Zhi Yong Wu
2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 1/4] block: add the command line support Zhi Yong Wu
2011-08-09 12:25   ` Stefan Hajnoczi
2011-08-10  5:20     ` Zhi Yong Wu
2011-08-10  9:27       ` Stefan Hajnoczi
2011-08-11  4:44         ` Zhi Yong Wu
2011-08-11  5:35           ` Stefan Hajnoczi
2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 2/4] block: add the block queue support Zhi Yong Wu
2011-08-09  8:46   ` Ram Pai
2011-08-09  8:53     ` Zhi Yong Wu
2011-08-09 12:49   ` Stefan Hajnoczi
2011-08-10  5:54     ` Zhi Yong Wu
2011-08-10  9:37       ` Stefan Hajnoczi
2011-08-11  4:59         ` Zhi Yong Wu
2011-08-11  5:36         ` Zhi Yong Wu
2011-08-12  4:40         ` Zhi Yong Wu
2011-08-12  4:50           ` Stefan Hajnoczi
2011-08-12  8:10     ` Zhi Yong Wu
2011-08-12  8:42       ` Stefan Hajnoczi
2011-08-12  9:11         ` Zhi Yong Wu
2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 3/4] block: add block timer and block throttling algorithm Zhi Yong Wu
2011-08-09  8:57   ` Ram Pai
2011-08-09  9:06     ` Zhi Yong Wu
2011-08-12  5:35     ` Zhi Yong Wu
2011-08-12  5:47       ` Stefan Hajnoczi
2011-08-12  6:00         ` Zhi Yong Wu
2011-08-09 15:19   ` Stefan Hajnoczi
2011-08-10  6:57     ` Zhi Yong Wu
2011-08-10 11:00       ` Stefan Hajnoczi
2011-08-12  5:00         ` Zhi Yong Wu
2011-08-12  5:06           ` Stefan Hajnoczi
2011-08-12  5:23             ` Zhi Yong Wu
2011-08-09  4:17 ` [Qemu-devel] [PATCH v5 4/4] qmp/hmp: add block_set_io_throttle Zhi Yong Wu
2011-08-09 12:08 ` [Qemu-devel] [PATCH v5 0/4] The intro of QEMU block I/O throttling Stefan Hajnoczi
2011-08-10  5:09   ` Zhi Yong Wu
2011-08-10  9:39     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).