qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling
@ 2013-08-02 15:53 Benoît Canet
  2013-08-02 15:53 ` [Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket Benoît Canet
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Benoît Canet @ 2013-08-02 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, Benoît Canet, stefanha

This patchset implement continous leaky bucket throttling.

It works mostly on the general case.
The exception is where the load is composed of both reads and writes and two
limits iops_rd and iops_wr are set.
The resulting iops are a little above half of the given limits.
I tried various strategies to avoid this: two timer, two throttled request
queues or even a different algorithm using a priority queue.
The problem is still the same in every version of the code: reads and writes
operation seems entangled.

Benoît Canet (2):
  throttle: Add a new throttling API implementing continuus leaky
    bucket.
  block: Enable the new throttling code in the block layer.

 block.c                   |  316 ++++++++------------------------
 block/qapi.c              |   21 +--
 blockdev.c                |  115 ++++++------
 include/block/block.h     |    1 -
 include/block/block_int.h |   33 +---
 include/qemu/throttle.h   |  111 ++++++++++++
 util/Makefile.objs        |    1 +
 util/throttle.c           |  436 +++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 698 insertions(+), 336 deletions(-)
 create mode 100644 include/qemu/throttle.h
 create mode 100644 util/throttle.c

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket.
  2013-08-02 15:53 [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Benoît Canet
@ 2013-08-02 15:53 ` Benoît Canet
  2013-08-02 15:53 ` [Qemu-devel] [RFC V3 2/2] block: Enable the new throttling code in the block layer Benoît Canet
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Benoît Canet @ 2013-08-02 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, Benoît Canet, stefanha

Implement the continuous leaky bucket algorithm devised on IRC as a separate
module.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 include/qemu/throttle.h |  111 ++++++++++++
 util/Makefile.objs      |    1 +
 util/throttle.c         |  436 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 548 insertions(+)
 create mode 100644 include/qemu/throttle.h
 create mode 100644 util/throttle.c

diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h
new file mode 100644
index 0000000..328c782
--- /dev/null
+++ b/include/qemu/throttle.h
@@ -0,0 +1,111 @@
+/*
+ * QEMU throttling infrastructure
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet <benoit.canet@irqsave.net>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 or
+ * (at your option) version 3 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef THROTTLING_H
+#define THROTTLING_H
+
+#include <stdint.h>
+#include "qemu-common.h"
+#include "qemu/timer.h"
+
+#define NANOSECONDS_PER_SECOND  1000000000.0
+
+#define BUCKETS_COUNT 6
+
+typedef enum {
+    THROTTLE_BPS_TOTAL = 0,
+    THROTTLE_BPS_READ  = 1,
+    THROTTLE_BPS_WRITE = 2,
+    THROTTLE_OPS_TOTAL = 3,
+    THROTTLE_OPS_READ  = 4,
+    THROTTLE_OPS_WRITE = 5,
+} BucketType;
+
+typedef struct LeakyBucket {
+    int64_t ups;           /* units per second */
+    int64_t max;           /* leaky bucket max in units */
+    double  bucket;        /* bucket in units */
+    int64_t previous_leak; /* timestamp of the last leak done */
+} LeakyBucket;
+
+/* The following structure is used to configure a ThrottleState
+ * It contains a bit of state: the bucket field of the LeakyBucket structure.
+ * However it allows to keep the code clean and the bucket field is reset to
+ * zero at the right time.
+ */
+typedef struct ThrottleConfig {
+    LeakyBucket buckets[6];    /* leaky buckets */
+    int64_t unit_size;         /* size of an unit in bytes */
+    int64_t op_size;           /* size of an operation in units */
+} ThrottleConfig;
+
+typedef struct ThrottleState {
+    ThrottleConfig cfg;
+    bool timer_is_throttling_write; /* is the timer throttling a write */
+    QEMUTimer *timer;            /* timer used to do the throttling */
+    QEMUClock *clock;            /* the clock used */
+} ThrottleState;
+
+/* following 3 function exposed for tests */
+bool throttle_do_start(ThrottleState *ts,
+                       bool is_write,
+                       int64_t size,
+                       int64_t now,
+                       int64_t *next_timer);
+
+bool throttle_do_end(ThrottleState *ts,
+                     bool is_write,
+                     int64_t now,
+                     int64_t *next_timer);
+
+bool throttle_do_timer(ThrottleState *ts,
+                       bool is_write,
+                       int64_t now,
+                       int64_t *next_timer);
+
+/* user API functions */
+void throttle_init(ThrottleState *ts,
+                   QEMUClock *clock,
+                   void (timer)(void *),
+                   void *timer_opaque);
+
+void throttle_destroy(ThrottleState *ts);
+
+bool throttle_start(ThrottleState *ts, bool is_write, int64_t size);
+
+void throttle_end(ThrottleState *ts, bool is_write);
+
+void throttle_timer(ThrottleState *ts, int64_t now, bool *must_wait);
+
+void throttle_config(ThrottleState *ts, ThrottleConfig *cfg);
+
+void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg);
+
+bool throttle_enabled(ThrottleConfig *cfg);
+
+bool throttle_conflicting(ThrottleConfig *cfg);
+
+bool throttle_is_valid(ThrottleConfig *cfg);
+
+bool throttle_have_timer(ThrottleState *ts);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index dc72ab0..2bb13a2 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -11,3 +11,4 @@ util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o notify.o
 util-obj-y += qemu-option.o qemu-progress.o
 util-obj-y += hexdump.o
 util-obj-y += crc32c.o
+util-obj-y += throttle.o
diff --git a/util/throttle.c b/util/throttle.c
new file mode 100644
index 0000000..4afc407
--- /dev/null
+++ b/util/throttle.c
@@ -0,0 +1,436 @@
+/*
+ * QEMU throttling infrastructure
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet <benoit.canet@irqsave.net>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 or
+ * (at your option) version 3 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/throttle.h"
+#include "qemu/timer.h"
+
+/* This function make a bucket leak
+ *
+ * @bkt:   the bucket to make leak
+ * @now:   the current timestamp in ns
+ */
+static void throttle_leak_bucket(LeakyBucket *bkt, int64_t now)
+{
+    /* compute the time elapsed since the last leak */
+    int64_t delta = now - bkt->previous_leak;
+    double leak;
+
+    bkt->previous_leak = now;
+
+    if (delta <= 0) {
+        return;
+    }
+
+    /* compute how much to leak */
+    leak = (double) (bkt->ups * delta) / NANOSECONDS_PER_SECOND;
+
+    /* make the bucket leak */
+    bkt->bucket = MAX(bkt->bucket - leak, 0);
+}
+
+/* Calculate the time delta since last leak and make proportionals leaks
+ *
+ * @now:      the current timestamp in ns
+ */
+static void throttle_do_leak(ThrottleState *ts, int64_t now)
+{
+    int i;
+
+    /* make each bucket leak */
+    for (i = 0; i < BUCKETS_COUNT; i++) {
+        throttle_leak_bucket(&ts->cfg.buckets[i], now);
+    }
+}
+
+/* do the real job of computing the time to wait */
+static int64_t throttle_do_compute_wait(int64_t limit, double extra)
+{
+    int64_t wait = extra * NANOSECONDS_PER_SECOND;
+    wait /= limit;
+    return wait;
+}
+
+/* This function compute the wait time in ns that a leaky bucket should trigger
+ *
+ * @bkt:      the leaky bucket we operate on
+ * @ret:      the resulting wait time in ns or 0 if the operation can go through
+ */
+static int64_t throttle_compute_wait(LeakyBucket *bkt)
+{
+    int64_t extra; /* the number of extra units blocking the io */
+
+    if (!bkt->ups) {
+        return 0;
+    }
+
+    extra = bkt->bucket - bkt->max;
+
+    return throttle_do_compute_wait(bkt->ups, extra);
+}
+
+/* This function compute the time that must be waited while this IO
+ *
+ * @is_write:   true if the current IO is a write, false if it's a read
+ * @ret:        time to wait
+ */
+static int64_t throttle_compute_max_wait(ThrottleState *ts,
+                                      bool is_write)
+{
+    BucketType to_check[2][4] = { {THROTTLE_BPS_TOTAL,
+                                   THROTTLE_OPS_TOTAL,
+                                   THROTTLE_BPS_READ,
+                                   THROTTLE_OPS_READ},
+                                  {THROTTLE_BPS_TOTAL,
+                                   THROTTLE_OPS_TOTAL,
+                                   THROTTLE_BPS_WRITE,
+                                   THROTTLE_OPS_WRITE}, };
+    int64_t wait, max_wait = 0;
+    int i;
+
+    for (i = 0; i < 4; i++) {
+        BucketType index = to_check[is_write][i];
+        wait = throttle_compute_wait(&ts->cfg.buckets[index]);
+        if (wait > max_wait) {
+            max_wait = wait;
+        }
+    }
+
+    return max_wait;
+}
+
+static bool throttle_leak_and_compute_timer(ThrottleState *ts,
+                                            bool is_write,
+                                            int64_t now,
+                                            int64_t *next_timer)
+{
+    int64_t wait;
+
+    /* leak proportionally to the time elapsed */
+    throttle_do_leak(ts, now);
+
+    /* compute the wait time if any */
+    wait = throttle_compute_max_wait(ts, is_write);
+
+    /* if the code must wait compute when the next timer should fire */
+    if (wait) {
+        *next_timer = now + wait;
+        return true;
+    }
+
+    /* else no need to wait at all */
+    *next_timer = now;
+    return false;
+}
+
+bool throttle_do_start(ThrottleState *ts,
+                       bool is_write,
+                       int64_t size,
+                       int64_t now,
+                       int64_t *next_timer)
+{
+    double bytes_size;
+    double units = 1.0;
+    bool must_wait = throttle_leak_and_compute_timer(ts,
+                                                     is_write,
+                                                     now,
+                                                     next_timer);
+
+    if (must_wait) {
+        return true;
+    }
+
+    if (ts->cfg.op_size) {
+        units = (double) size / ts->cfg.op_size;
+    }
+
+    /* NOTE: the counter can go above the max when authorizing an IO.
+     *       At next call the code will punish the guest by blocking the
+     *       next IO until the counter has been decremented below the max.
+     *       This way if a guest issue a jumbo IO bigger than the max it
+     *       will have a chance no be authorized and will not result in a guest
+     *       IO deadlock.
+     */
+
+    /* the IO is authorized so do the accounting and return false */
+    bytes_size = size * ts->cfg.unit_size;
+    ts->cfg.buckets[THROTTLE_BPS_TOTAL].bucket += bytes_size;
+    ts->cfg.buckets[THROTTLE_OPS_TOTAL].bucket += units;
+
+    if (is_write) {
+        ts->cfg.buckets[THROTTLE_BPS_WRITE].bucket += bytes_size;
+        ts->cfg.buckets[THROTTLE_OPS_WRITE].bucket += units;
+    } else {
+        ts->cfg.buckets[THROTTLE_BPS_READ].bucket += bytes_size;
+        ts->cfg.buckets[THROTTLE_OPS_READ].bucket += units;
+    }
+
+    /* no wait */
+    return false;
+}
+
+bool throttle_do_end(ThrottleState *ts,
+                     bool is_write,
+                     int64_t now,
+                     int64_t *next_timer)
+{
+    return throttle_leak_and_compute_timer(ts, is_write, now, next_timer);
+}
+
+bool throttle_do_timer(ThrottleState *ts,
+                       bool is_write,
+                       int64_t now,
+                       int64_t *next_timer)
+{
+    return throttle_leak_and_compute_timer(ts, is_write, now, next_timer);
+}
+
+/* To be called first on the ThrottleState */
+void throttle_init(ThrottleState *ts,
+                   QEMUClock *clock,
+                   void (timer)(void *),
+                   void *timer_opaque)
+{
+    memset(ts, 0, sizeof(ThrottleState));
+
+    ts->clock = clock;
+    ts->timer = qemu_new_timer_ns(vm_clock, timer, timer_opaque);
+}
+
+/* To be called last on the ThrottleState */
+void throttle_destroy(ThrottleState *ts)
+{
+    assert(ts->timer != NULL);
+
+    qemu_del_timer(ts->timer);
+    qemu_free_timer(ts->timer);
+    ts->timer = NULL;
+}
+
+/* Used to configure the throttle
+ *
+ * @ts: the throttle state we are working on
+ * @cfg: the config to set
+ */
+void throttle_config(ThrottleState *ts, ThrottleConfig *cfg)
+{
+    int i;
+
+    ts->cfg = *cfg;
+
+    /* zero the buckets */
+    for (i = 0; i < BUCKETS_COUNT; i++) {
+        ts->cfg.buckets[i].bucket = 0;
+    }
+
+    /* init previous leaks fields */
+    for (i = 0; i < BUCKETS_COUNT; i++) {
+        ts->cfg.buckets[i].previous_leak = qemu_get_clock_ns(vm_clock);
+    }
+
+    assert(ts->timer != NULL);
+    if (!qemu_timer_pending(ts->timer)) {
+        return;
+    }
+
+    /* cancel current running timer */
+    qemu_del_timer(ts->timer);
+}
+
+/* used to get config
+ *
+ * @ts: the throttle state we are working on
+ * @cfg: where to write the config
+ */
+void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg)
+{
+    *cfg = ts->cfg;
+}
+
+/* avoid any stutter due to reads and writes throttling interleaving */
+static int64_t throttle_avoid_stutter(ThrottleState *ts,
+                                      bool is_write,
+                                      int64_t next_timer)
+{
+    int64_t current_timer;
+
+    /* last throttled operation was of the same type -> do nothing */
+    if (ts->timer_is_throttling_write == is_write) {
+        return next_timer;
+    }
+
+    /* no timer is pending -> do nothing */
+    if (!qemu_timer_pending(ts->timer)) {
+        return next_timer;
+    }
+
+    /* get back the current running timer expiration time */
+    current_timer = qemu_timer_expire_time_ns(ts->timer);
+
+    /* if the timer in place is nearest in the future keep it */
+    if (current_timer < next_timer) {
+        return current_timer;
+    }
+
+    /* remember the time of operation the timer is throttling */
+    ts->timer_is_throttling_write = is_write;
+
+    return next_timer;
+}
+
+bool throttle_start(ThrottleState *ts, bool is_write, int64_t size)
+{
+    int now = qemu_get_clock_ns(ts->clock);
+    int64_t next_timer;
+    bool must_wait;
+
+    must_wait = throttle_do_start(ts, is_write, size, now, &next_timer);
+
+    if (!must_wait) {
+        return false;
+    }
+
+    next_timer = throttle_avoid_stutter(ts, is_write, next_timer);
+    qemu_mod_timer(ts->timer, next_timer);
+    return true;
+}
+
+void throttle_end(ThrottleState *ts, bool is_write)
+{
+    int now = qemu_get_clock_ns(ts->clock);
+    int64_t next_timer;
+    bool must_wait;
+
+    must_wait = throttle_do_end(ts, is_write, now, &next_timer);
+
+    if (!must_wait) {
+        return;
+    }
+
+    next_timer = throttle_avoid_stutter(ts, is_write, next_timer);
+    qemu_mod_timer(ts->timer, next_timer);
+}
+
+static void throttle_swap_timers(int64_t *next_timers)
+{
+    int64_t tmp = next_timers[0];
+    next_timers[0] = next_timers[1];
+    next_timers[1] = tmp;
+}
+
+static void throttle_sort_timers(int64_t *next_timers)
+{
+    if (next_timers[0] < next_timers[1]) {
+        return;
+    }
+
+    throttle_swap_timers(next_timers);
+}
+
+void throttle_timer(ThrottleState *ts, int64_t now, bool *must_wait)
+{
+    int64_t next_timers[2];
+    int i;
+
+    /* for reads and writes must the current IO wait and how much */
+    for (i = 0; i < 2; i++) {
+        must_wait[i] = throttle_do_timer(ts,
+                                         i,
+                                         now,
+                                         &next_timers[i]);
+    }
+
+    throttle_sort_timers(next_timers);
+
+    /* if both read and write IO are to throttle take the smallest timer */
+    if (must_wait[0] && must_wait[1]) {
+        qemu_mod_timer(ts->timer, next_timers[0]);
+    /* if only one type of IO is to throttle take the biggest timer */
+    } else if (must_wait[0] || must_wait[1]) {
+        qemu_mod_timer(ts->timer, next_timers[1]);
+    }
+}
+
+bool throttle_enabled(ThrottleConfig *cfg)
+{
+    int i;
+
+    for (i = 0; i < BUCKETS_COUNT; i++) {
+        if (cfg->buckets[i].ups) {
+            return true;
+        }
+    }
+
+    return false;
+}
+
+bool throttle_conflicting(ThrottleConfig *cfg)
+{
+    bool bps_flag, ops_flag;
+    bool bps_max_flag, ops_max_flag;
+
+    bps_flag = cfg->buckets[THROTTLE_BPS_TOTAL].ups &&
+               (cfg->buckets[THROTTLE_BPS_READ].ups ||
+                cfg->buckets[THROTTLE_BPS_WRITE].ups);
+
+    ops_flag = cfg->buckets[THROTTLE_OPS_TOTAL].ups &&
+               (cfg->buckets[THROTTLE_OPS_READ].ups ||
+                cfg->buckets[THROTTLE_OPS_WRITE].ups);
+
+    bps_max_flag = cfg->buckets[THROTTLE_BPS_TOTAL].max &&
+                  (cfg->buckets[THROTTLE_BPS_READ].max  ||
+                   cfg->buckets[THROTTLE_BPS_WRITE].max);
+
+    ops_max_flag = cfg->buckets[THROTTLE_OPS_TOTAL].max &&
+                   (cfg->buckets[THROTTLE_OPS_READ].max ||
+                   cfg->buckets[THROTTLE_OPS_WRITE].max);
+
+    return bps_flag || ops_flag || bps_max_flag || ops_max_flag;
+}
+
+bool throttle_is_valid(ThrottleConfig *cfg)
+{
+    bool invalid = false;
+    int i;
+
+    for (i = 0; i < BUCKETS_COUNT; i++) {
+        if (cfg->buckets[i].ups < 0) {
+            invalid = true;
+        }
+    }
+
+    for (i = 0; i < BUCKETS_COUNT; i++) {
+        if (cfg->buckets[i].max < 0) {
+            invalid = true;
+        }
+    }
+
+    return !invalid;
+}
+
+bool throttle_have_timer(ThrottleState *ts)
+{
+    if (ts->timer) {
+        return true;
+    }
+
+    return false;
+}
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [Qemu-devel] [RFC V3 2/2] block: Enable the new throttling code in the block layer.
  2013-08-02 15:53 [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Benoît Canet
  2013-08-02 15:53 ` [Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket Benoît Canet
@ 2013-08-02 15:53 ` Benoît Canet
  2013-08-06  9:22 ` [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Fam Zheng
  2013-08-07  8:31 ` Stefan Hajnoczi
  3 siblings, 0 replies; 6+ messages in thread
From: Benoît Canet @ 2013-08-02 15:53 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, pbonzini, Benoît Canet, stefanha

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block.c                   |  316 +++++++++++----------------------------------
 block/qapi.c              |   21 ++-
 blockdev.c                |  115 +++++++++--------
 include/block/block.h     |    1 -
 include/block/block_int.h |   33 +----
 5 files changed, 150 insertions(+), 336 deletions(-)

diff --git a/block.c b/block.c
index 5a2240f..efa64fa 100644
--- a/block.c
+++ b/block.c
@@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
     int64_t sector_num, int nb_sectors);
 
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-        bool is_write, double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-        double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-        bool is_write, int64_t *wait);
-
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
     QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -123,55 +116,68 @@ int is_windows_drive(const char *filename)
 #endif
 
 /* throttling disk I/O limits */
-void bdrv_io_limits_disable(BlockDriverState *bs)
+void bdrv_set_io_limits(BlockDriverState *bs,
+                        ThrottleConfig *cfg)
 {
-    bs->io_limits_enabled = false;
+    throttle_config(&bs->throttle_state, cfg);
+}
 
-    while (qemu_co_enter_next(&bs->throttled_reqs)) {
-    }
+static bool bdrv_drain_throttled(BlockDriverState *bs)
+{
+    bool drained = false;
+    int i;
 
-    if (bs->block_timer) {
-        qemu_del_timer(bs->block_timer);
-        qemu_free_timer(bs->block_timer);
-        bs->block_timer = NULL;
+    for (i = 0; i < 2; i++) {
+        while (qemu_co_enter_next(&bs->throttled_reqs[i]) {
+            drained = true;
+        }
     }
 
-    bs->slice_start = 0;
-    bs->slice_end   = 0;
+    return drained;
 }
 
-static void bdrv_block_timer(void *opaque)
+void bdrv_io_limits_disable(BlockDriverState *bs)
 {
-    BlockDriverState *bs = opaque;
+    bs->io_limits_enabled = false;
+
+    bdrv_drain_throttled(bs);
 
-    qemu_co_enter_next(&bs->throttled_reqs);
+    throttle_destroy(&bs->throttle_state);
 }
 
-void bdrv_io_limits_enable(BlockDriverState *bs)
+static void bdrv_throttle_timer_cb(void *opaque)
 {
-    qemu_co_queue_init(&bs->throttled_reqs);
-    bs->block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
-    bs->io_limits_enabled = true;
+    BlockDriverState *bs = opaque;
+    int now = qemu_get_clock_ns(vm_clock);
+    bool must_wait[2]; /* does throttled reads or writes must wait */
+    int i;
+
+    throttle_timer(&bs->throttle_state, now, must_wait);
+
+    /* execute not throttled requests */
+    for (i = 0; i < 2; i++)  {
+        if (must_wait[i]) {
+            continue;
+        }
+        qemu_co_enter_next(&bs->throttled_reqs[i]);
+    }
 }
 
-bool bdrv_io_limits_enabled(BlockDriverState *bs)
+/* should be called before bdrv_set_io_limits if a limit is set */
+void bdrv_io_limits_enable(BlockDriverState *bs)
 {
-    BlockIOLimit *io_limits = &bs->io_limits;
-    return io_limits->bps[BLOCK_IO_LIMIT_READ]
-         || io_limits->bps[BLOCK_IO_LIMIT_WRITE]
-         || io_limits->bps[BLOCK_IO_LIMIT_TOTAL]
-         || io_limits->iops[BLOCK_IO_LIMIT_READ]
-         || io_limits->iops[BLOCK_IO_LIMIT_WRITE]
-         || io_limits->iops[BLOCK_IO_LIMIT_TOTAL];
+    throttle_init(&bs->throttle_state, vm_clock, bdrv_throttle_timer_cb, bs);
+    qemu_co_queue_init(&bs->throttled_reqs[0]);
+    qemu_co_queue_init(&bs->throttled_reqs[1]);
+    bs->io_limits_enabled = true;
 }
 
 static void bdrv_io_limits_intercept(BlockDriverState *bs,
-                                     bool is_write, int nb_sectors)
+                                     bool is_write,
+                                     int nb_sectors)
 {
-    int64_t wait_time = -1;
-
-    if (!qemu_co_queue_empty(&bs->throttled_reqs)) {
-        qemu_co_queue_wait(&bs->throttled_reqs);
+    if (!qemu_co_queue_empty(&bs->throttled_reqs[is_write])) {
+        qemu_co_queue_wait(&bs->throttled_reqs[is_write]);
     }
 
     /* In fact, we hope to keep each request's timing, in FIFO mode. The next
@@ -181,13 +187,11 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
      * be still in throttled_reqs queue.
      */
 
-    while (bdrv_exceed_io_limits(bs, nb_sectors, is_write, &wait_time)) {
-        qemu_mod_timer(bs->block_timer,
-                       wait_time + qemu_get_clock_ns(vm_clock));
-        qemu_co_queue_wait_insert_head(&bs->throttled_reqs);
+    while (throttle_start(&bs->throttle_state, is_write, nb_sectors)) {
+        qemu_co_queue_wait_insert_head(&bs->throttled_reqs[is_write]);
     }
 
-    qemu_co_queue_next(&bs->throttled_reqs);
+    qemu_co_queue_next(&bs->throttled_reqs[is_write]);
 }
 
 /* check if the path starts with "<protocol>:" */
@@ -1106,11 +1110,6 @@ int bdrv_open(BlockDriverState *bs, const char *filename, QDict *options,
         bdrv_dev_change_media_cb(bs, true);
     }
 
-    /* throttling disk I/O limits */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_enable(bs);
-    }
-
     return 0;
 
 unlink_and_fail:
@@ -1446,16 +1445,15 @@ void bdrv_drain_all(void)
          * a busy wait.
          */
         QTAILQ_FOREACH(bs, &bdrv_states, list) {
-            while (qemu_co_enter_next(&bs->throttled_reqs)) {
-                busy = true;
-            }
+            busy = bdrv_drain_throttled(bs);
         }
     } while (busy);
 
     /* If requests are still pending there is a bug somewhere */
     QTAILQ_FOREACH(bs, &bdrv_states, list) {
         assert(QLIST_EMPTY(&bs->tracked_requests));
-        assert(qemu_co_queue_empty(&bs->throttled_reqs));
+        assert(qemu_co_queue_empty(&bs->throttled_reqs[0]));
+        assert(qemu_co_queue_empty(&bs->throttled_reqs[1]));
     }
 }
 
@@ -1491,13 +1489,12 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest,
 
     bs_dest->enable_write_cache = bs_src->enable_write_cache;
 
-    /* i/o timing parameters */
-    bs_dest->slice_start        = bs_src->slice_start;
-    bs_dest->slice_end          = bs_src->slice_end;
-    bs_dest->slice_submitted    = bs_src->slice_submitted;
-    bs_dest->io_limits          = bs_src->io_limits;
-    bs_dest->throttled_reqs     = bs_src->throttled_reqs;
-    bs_dest->block_timer        = bs_src->block_timer;
+    /* i/o throttled req */
+    memcpy(&bs_dest->throttle_state,
+           &bs_src->throttle_state,
+           sizeof(ThrottleState));
+    bs_dest->throttled_reqs[0]  = bs_src->throttled_reqs[0];
+    bs_dest->throttled_reqs[1]  = bs_src->throttled_reqs[1];
     bs_dest->io_limits_enabled  = bs_src->io_limits_enabled;
 
     /* r/w error */
@@ -1544,7 +1541,7 @@ void bdrv_swap(BlockDriverState *bs_new, BlockDriverState *bs_old)
     assert(bs_new->dev == NULL);
     assert(bs_new->in_use == 0);
     assert(bs_new->io_limits_enabled == false);
-    assert(bs_new->block_timer == NULL);
+    assert(!throttle_have_timer(&bs_new->throttle_state));
 
     tmp = *bs_new;
     *bs_new = *bs_old;
@@ -1563,7 +1560,7 @@ void bdrv_swap(BlockDriverState *bs_new, BlockDriverState *bs_old)
     assert(bs_new->job == NULL);
     assert(bs_new->in_use == 0);
     assert(bs_new->io_limits_enabled == false);
-    assert(bs_new->block_timer == NULL);
+    assert(!throttle_have_timer(&bs_new->throttle_state));
 
     bdrv_rebind(bs_new);
     bdrv_rebind(bs_old);
@@ -1854,8 +1851,15 @@ int bdrv_commit_all(void)
  *
  * This function should be called when a tracked request is completing.
  */
-static void tracked_request_end(BdrvTrackedRequest *req)
+static void tracked_request_end(BlockDriverState *bs,
+                                BdrvTrackedRequest *req,
+                                bool is_write)
 {
+    /* throttling disk I/O */
+    if (bs->io_limits_enabled) {
+        throttle_end(&bs->throttle_state, is_write);
+    }
+
     QLIST_REMOVE(req, list);
     qemu_co_queue_restart_all(&req->wait_queue);
 }
@@ -1868,6 +1872,11 @@ static void tracked_request_begin(BdrvTrackedRequest *req,
                                   int64_t sector_num,
                                   int nb_sectors, bool is_write)
 {
+    /* throttling disk I/O */
+    if (bs->io_limits_enabled) {
+        bdrv_io_limits_intercept(bs, is_write, nb_sectors);
+    }
+
     *req = (BdrvTrackedRequest){
         .bs = bs,
         .sector_num = sector_num,
@@ -2506,11 +2515,6 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
         return -EIO;
     }
 
-    /* throttling disk read I/O */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, false, nb_sectors);
-    }
-
     if (bs->copy_on_read) {
         flags |= BDRV_REQ_COPY_ON_READ;
     }
@@ -2541,7 +2545,7 @@ static int coroutine_fn bdrv_co_do_readv(BlockDriverState *bs,
     ret = drv->bdrv_co_readv(bs, sector_num, nb_sectors, qiov);
 
 out:
-    tracked_request_end(&req);
+    tracked_request_end(bs, &req, false);
 
     if (flags & BDRV_REQ_COPY_ON_READ) {
         bs->copy_on_read_in_flight--;
@@ -2619,11 +2623,6 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         return -EIO;
     }
 
-    /* throttling disk write I/O */
-    if (bs->io_limits_enabled) {
-        bdrv_io_limits_intercept(bs, true, nb_sectors);
-    }
-
     if (bs->copy_on_read_in_flight) {
         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
     }
@@ -2652,7 +2651,7 @@ static int coroutine_fn bdrv_co_do_writev(BlockDriverState *bs,
         bs->wr_highest_sector = sector_num + nb_sectors - 1;
     }
 
-    tracked_request_end(&req);
+    tracked_request_end(bs, &req, true);
 
     return ret;
 }
@@ -2745,14 +2744,6 @@ void bdrv_get_geometry(BlockDriverState *bs, uint64_t *nb_sectors_ptr)
     *nb_sectors_ptr = length;
 }
 
-/* throttling disk io limits */
-void bdrv_set_io_limits(BlockDriverState *bs,
-                        BlockIOLimit *io_limits)
-{
-    bs->io_limits = *io_limits;
-    bs->io_limits_enabled = bdrv_io_limits_enabled(bs);
-}
-
 void bdrv_set_on_error(BlockDriverState *bs, BlockdevOnError on_read_error,
                        BlockdevOnError on_write_error)
 {
@@ -3562,169 +3553,6 @@ void bdrv_aio_cancel(BlockDriverAIOCB *acb)
     acb->aiocb_info->cancel(acb);
 }
 
-/* block I/O throttling */
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-                 bool is_write, double elapsed_time, uint64_t *wait)
-{
-    uint64_t bps_limit = 0;
-    uint64_t extension;
-    double   bytes_limit, bytes_base, bytes_res;
-    double   slice_time, wait_time;
-
-    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
-        bps_limit = bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
-    } else if (bs->io_limits.bps[is_write]) {
-        bps_limit = bs->io_limits.bps[is_write];
-    } else {
-        if (wait) {
-            *wait = 0;
-        }
-
-        return false;
-    }
-
-    slice_time = bs->slice_end - bs->slice_start;
-    slice_time /= (NANOSECONDS_PER_SECOND);
-    bytes_limit = bps_limit * slice_time;
-    bytes_base  = bs->slice_submitted.bytes[is_write];
-    if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
-        bytes_base += bs->slice_submitted.bytes[!is_write];
-    }
-
-    /* bytes_base: the bytes of data which have been read/written; and
-     *             it is obtained from the history statistic info.
-     * bytes_res: the remaining bytes of data which need to be read/written.
-     * (bytes_base + bytes_res) / bps_limit: used to calcuate
-     *             the total time for completing reading/writting all data.
-     */
-    bytes_res   = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
-
-    if (bytes_base + bytes_res <= bytes_limit) {
-        if (wait) {
-            *wait = 0;
-        }
-
-        return false;
-    }
-
-    /* Calc approx time to dispatch */
-    wait_time = (bytes_base + bytes_res) / bps_limit - elapsed_time;
-
-    /* When the I/O rate at runtime exceeds the limits,
-     * bs->slice_end need to be extended in order that the current statistic
-     * info can be kept until the timer fire, so it is increased and tuned
-     * based on the result of experiment.
-     */
-    extension = wait_time * NANOSECONDS_PER_SECOND;
-    extension = DIV_ROUND_UP(extension, BLOCK_IO_SLICE_TIME) *
-                BLOCK_IO_SLICE_TIME;
-    bs->slice_end += extension;
-    if (wait) {
-        *wait = wait_time * NANOSECONDS_PER_SECOND;
-    }
-
-    return true;
-}
-
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-                             double elapsed_time, uint64_t *wait)
-{
-    uint64_t iops_limit = 0;
-    double   ios_limit, ios_base;
-    double   slice_time, wait_time;
-
-    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
-        iops_limit = bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
-    } else if (bs->io_limits.iops[is_write]) {
-        iops_limit = bs->io_limits.iops[is_write];
-    } else {
-        if (wait) {
-            *wait = 0;
-        }
-
-        return false;
-    }
-
-    slice_time = bs->slice_end - bs->slice_start;
-    slice_time /= (NANOSECONDS_PER_SECOND);
-    ios_limit  = iops_limit * slice_time;
-    ios_base   = bs->slice_submitted.ios[is_write];
-    if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
-        ios_base += bs->slice_submitted.ios[!is_write];
-    }
-
-    if (ios_base + 1 <= ios_limit) {
-        if (wait) {
-            *wait = 0;
-        }
-
-        return false;
-    }
-
-    /* Calc approx time to dispatch, in seconds */
-    wait_time = (ios_base + 1) / iops_limit;
-    if (wait_time > elapsed_time) {
-        wait_time = wait_time - elapsed_time;
-    } else {
-        wait_time = 0;
-    }
-
-    /* Exceeded current slice, extend it by another slice time */
-    bs->slice_end += BLOCK_IO_SLICE_TIME;
-    if (wait) {
-        *wait = wait_time * NANOSECONDS_PER_SECOND;
-    }
-
-    return true;
-}
-
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-                           bool is_write, int64_t *wait)
-{
-    int64_t  now, max_wait;
-    uint64_t bps_wait = 0, iops_wait = 0;
-    double   elapsed_time;
-    int      bps_ret, iops_ret;
-
-    now = qemu_get_clock_ns(vm_clock);
-    if (now > bs->slice_end) {
-        bs->slice_start = now;
-        bs->slice_end   = now + BLOCK_IO_SLICE_TIME;
-        memset(&bs->slice_submitted, 0, sizeof(bs->slice_submitted));
-    }
-
-    elapsed_time  = now - bs->slice_start;
-    elapsed_time  /= (NANOSECONDS_PER_SECOND);
-
-    bps_ret  = bdrv_exceed_bps_limits(bs, nb_sectors,
-                                      is_write, elapsed_time, &bps_wait);
-    iops_ret = bdrv_exceed_iops_limits(bs, is_write,
-                                      elapsed_time, &iops_wait);
-    if (bps_ret || iops_ret) {
-        max_wait = bps_wait > iops_wait ? bps_wait : iops_wait;
-        if (wait) {
-            *wait = max_wait;
-        }
-
-        now = qemu_get_clock_ns(vm_clock);
-        if (bs->slice_end < now + max_wait) {
-            bs->slice_end = now + max_wait;
-        }
-
-        return true;
-    }
-
-    if (wait) {
-        *wait = 0;
-    }
-
-    bs->slice_submitted.bytes[is_write] += (int64_t)nb_sectors *
-                                           BDRV_SECTOR_SIZE;
-    bs->slice_submitted.ios[is_write]++;
-
-    return false;
-}
-
 /**************************************************************/
 /* async block device emulation */
 
diff --git a/block/qapi.c b/block/qapi.c
index a4bc411..45f806b 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -223,18 +223,15 @@ void bdrv_query_info(BlockDriverState *bs,
         info->inserted->backing_file_depth = bdrv_get_backing_file_depth(bs);
 
         if (bs->io_limits_enabled) {
-            info->inserted->bps =
-                           bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL];
-            info->inserted->bps_rd =
-                           bs->io_limits.bps[BLOCK_IO_LIMIT_READ];
-            info->inserted->bps_wr =
-                           bs->io_limits.bps[BLOCK_IO_LIMIT_WRITE];
-            info->inserted->iops =
-                           bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL];
-            info->inserted->iops_rd =
-                           bs->io_limits.iops[BLOCK_IO_LIMIT_READ];
-            info->inserted->iops_wr =
-                           bs->io_limits.iops[BLOCK_IO_LIMIT_WRITE];
+            ThrottleConfig cfg;
+            throttle_get_config(&bs->throttle_state, &cfg);
+            info->inserted->bps     = cfg.buckets[THROTTLE_BPS_TOTAL].ups;
+            info->inserted->bps_rd  = cfg.buckets[THROTTLE_BPS_READ].ups;
+            info->inserted->bps_wr  = cfg.buckets[THROTTLE_BPS_WRITE].ups;
+
+            info->inserted->iops    = cfg.buckets[THROTTLE_OPS_TOTAL].ups;
+            info->inserted->iops_rd = cfg.buckets[THROTTLE_OPS_READ].ups;
+            info->inserted->iops_wr = cfg.buckets[THROTTLE_OPS_WRITE].ups;
         }
 
         bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index c5abd65..8eaa39f 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -280,32 +280,16 @@ static int parse_block_error_action(const char *buf, bool is_read)
     }
 }
 
-static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp)
+static bool do_check_ts(ThrottleConfig *cfg, Error **errp)
 {
-    bool bps_flag;
-    bool iops_flag;
-
-    assert(io_limits);
-
-    bps_flag  = (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] != 0)
-                 && ((io_limits->bps[BLOCK_IO_LIMIT_READ] != 0)
-                 || (io_limits->bps[BLOCK_IO_LIMIT_WRITE] != 0));
-    iops_flag = (io_limits->iops[BLOCK_IO_LIMIT_TOTAL] != 0)
-                 && ((io_limits->iops[BLOCK_IO_LIMIT_READ] != 0)
-                 || (io_limits->iops[BLOCK_IO_LIMIT_WRITE] != 0));
-    if (bps_flag || iops_flag) {
-        error_setg(errp, "bps(iops) and bps_rd/bps_wr(iops_rd/iops_wr) "
+    if (throttle_conflicting(cfg)) {
+        error_setg(errp, "bps/iops/max total values and read/write values"
                          "cannot be used at the same time");
         return false;
     }
 
-    if (io_limits->bps[BLOCK_IO_LIMIT_TOTAL] < 0 ||
-        io_limits->bps[BLOCK_IO_LIMIT_WRITE] < 0 ||
-        io_limits->bps[BLOCK_IO_LIMIT_READ] < 0 ||
-        io_limits->iops[BLOCK_IO_LIMIT_TOTAL] < 0 ||
-        io_limits->iops[BLOCK_IO_LIMIT_WRITE] < 0 ||
-        io_limits->iops[BLOCK_IO_LIMIT_READ] < 0) {
-        error_setg(errp, "bps and iops values must be 0 or greater");
+    if (!throttle_is_valid(cfg)) {
+        error_setg(errp, "bps/iops/maxs values must be 0 or greater");
         return false;
     }
 
@@ -330,7 +314,7 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type)
     int on_read_error, on_write_error;
     const char *devaddr;
     DriveInfo *dinfo;
-    BlockIOLimit io_limits;
+    ThrottleConfig cfg;
     int snapshot = 0;
     bool copy_on_read;
     int ret;
@@ -485,20 +469,32 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type)
     }
 
     /* disk I/O throttling */
-    io_limits.bps[BLOCK_IO_LIMIT_TOTAL]  =
-                           qemu_opt_get_number(opts, "bps", 0);
-    io_limits.bps[BLOCK_IO_LIMIT_READ]   =
-                           qemu_opt_get_number(opts, "bps_rd", 0);
-    io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
-                           qemu_opt_get_number(opts, "bps_wr", 0);
-    io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
-                           qemu_opt_get_number(opts, "iops", 0);
-    io_limits.iops[BLOCK_IO_LIMIT_READ]  =
-                           qemu_opt_get_number(opts, "iops_rd", 0);
-    io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
-                           qemu_opt_get_number(opts, "iops_wr", 0);
-
-    if (!do_check_io_limits(&io_limits, &error)) {
+    cfg.buckets[THROTTLE_BPS_TOTAL].ups =
+        qemu_opt_get_number(opts, "bps", 0);
+    cfg.buckets[THROTTLE_BPS_READ].ups  =
+        qemu_opt_get_number(opts, "bps_rd", 0);
+    cfg.buckets[THROTTLE_BPS_WRITE].ups =
+        qemu_opt_get_number(opts, "bps_wr", 0);
+
+    cfg.buckets[THROTTLE_OPS_TOTAL].ups =
+        qemu_opt_get_number(opts, "iops", 0);
+    cfg.buckets[THROTTLE_OPS_READ].ups =
+        qemu_opt_get_number(opts, "iops_rd", 0);
+    cfg.buckets[THROTTLE_OPS_WRITE].ups =
+        qemu_opt_get_number(opts, "iops_wr", 0);
+
+    cfg.buckets[THROTTLE_BPS_TOTAL].max = 0;
+    cfg.buckets[THROTTLE_BPS_READ].max  = 0;
+    cfg.buckets[THROTTLE_BPS_WRITE].max = 0;
+
+    cfg.buckets[THROTTLE_OPS_TOTAL].max = 0;
+    cfg.buckets[THROTTLE_OPS_READ].max  = 0;
+    cfg.buckets[THROTTLE_OPS_WRITE].max = 0;
+
+    cfg.unit_size = BDRV_SECTOR_SIZE;
+    cfg.op_size = 0;
+
+    if (!do_check_ts(&cfg, &error)) {
         error_report("%s", error_get_pretty(error));
         error_free(error);
         return NULL;
@@ -625,7 +621,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type)
     bdrv_set_on_error(dinfo->bdrv, on_read_error, on_write_error);
 
     /* disk I/O throttling */
-    bdrv_set_io_limits(dinfo->bdrv, &io_limits);
+    if (throttle_enabled(&cfg)) {
+        bdrv_io_limits_enable(dinfo->bdrv);
+        bdrv_set_io_limits(dinfo->bdrv, &cfg);
+    }
 
     switch(type) {
     case IF_IDE:
@@ -1183,7 +1182,7 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd,
                                int64_t bps_wr, int64_t iops, int64_t iops_rd,
                                int64_t iops_wr, Error **errp)
 {
-    BlockIOLimit io_limits;
+    ThrottleConfig cfg;
     BlockDriverState *bs;
 
     bs = bdrv_find(device);
@@ -1192,27 +1191,37 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd,
         return;
     }
 
-    io_limits.bps[BLOCK_IO_LIMIT_TOTAL] = bps;
-    io_limits.bps[BLOCK_IO_LIMIT_READ]  = bps_rd;
-    io_limits.bps[BLOCK_IO_LIMIT_WRITE] = bps_wr;
-    io_limits.iops[BLOCK_IO_LIMIT_TOTAL]= iops;
-    io_limits.iops[BLOCK_IO_LIMIT_READ] = iops_rd;
-    io_limits.iops[BLOCK_IO_LIMIT_WRITE]= iops_wr;
+    cfg.buckets[THROTTLE_BPS_TOTAL].ups = bps;
+    cfg.buckets[THROTTLE_BPS_READ].ups  = bps_rd;
+    cfg.buckets[THROTTLE_BPS_WRITE].ups = bps_wr;
+
+    cfg.buckets[THROTTLE_OPS_TOTAL].ups = iops;
+    cfg.buckets[THROTTLE_OPS_READ].ups  = iops_rd;
+    cfg.buckets[THROTTLE_OPS_WRITE].ups = iops_wr;
+
+    cfg.buckets[THROTTLE_BPS_TOTAL].max = 0;
+    cfg.buckets[THROTTLE_BPS_READ].max  = 0;
+    cfg.buckets[THROTTLE_BPS_WRITE].max = 0;
+
+    cfg.buckets[THROTTLE_OPS_TOTAL].max = 0;
+    cfg.buckets[THROTTLE_OPS_READ].max  = 0;
+    cfg.buckets[THROTTLE_OPS_WRITE].max = 0;
+
+    cfg.unit_size = BDRV_SECTOR_SIZE;
+    cfg.op_size = 0;
 
-    if (!do_check_io_limits(&io_limits, errp)) {
+    if (!do_check_ts(&cfg, errp)) {
         return;
     }
 
-    bs->io_limits = io_limits;
-
-    if (!bs->io_limits_enabled && bdrv_io_limits_enabled(bs)) {
+    if (!bs->io_limits_enabled && throttle_enabled(&cfg)) {
         bdrv_io_limits_enable(bs);
-    } else if (bs->io_limits_enabled && !bdrv_io_limits_enabled(bs)) {
+    } else if (bs->io_limits_enabled && !throttle_enabled(&cfg)) {
         bdrv_io_limits_disable(bs);
-    } else {
-        if (bs->block_timer) {
-            qemu_mod_timer(bs->block_timer, qemu_get_clock_ns(vm_clock));
-        }
+    }
+
+    if (bs->io_limits_enabled) {
+        bdrv_set_io_limits(bs, &cfg);
     }
 }
 
diff --git a/include/block/block.h b/include/block/block.h
index 742fce5..b16d579 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -107,7 +107,6 @@ void bdrv_info_stats(Monitor *mon, QObject **ret_data);
 /* disk I/O throttling */
 void bdrv_io_limits_enable(BlockDriverState *bs);
 void bdrv_io_limits_disable(BlockDriverState *bs);
-bool bdrv_io_limits_enabled(BlockDriverState *bs);
 
 void bdrv_init(void);
 void bdrv_init_with_whitelist(void);
diff --git a/include/block/block_int.h b/include/block/block_int.h
index c6ac871..50cd66f 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -34,18 +34,12 @@
 #include "monitor/monitor.h"
 #include "qemu/hbitmap.h"
 #include "block/snapshot.h"
+#include "qemu/throttle.h"
 
 #define BLOCK_FLAG_ENCRYPT          1
 #define BLOCK_FLAG_COMPAT6          4
 #define BLOCK_FLAG_LAZY_REFCOUNTS   8
 
-#define BLOCK_IO_LIMIT_READ     0
-#define BLOCK_IO_LIMIT_WRITE    1
-#define BLOCK_IO_LIMIT_TOTAL    2
-
-#define BLOCK_IO_SLICE_TIME     100000000
-#define NANOSECONDS_PER_SECOND  1000000000.0
-
 #define BLOCK_OPT_SIZE              "size"
 #define BLOCK_OPT_ENCRYPT           "encryption"
 #define BLOCK_OPT_COMPAT6           "compat6"
@@ -69,17 +63,6 @@ typedef struct BdrvTrackedRequest {
     CoQueue wait_queue; /* coroutines blocked on this request */
 } BdrvTrackedRequest;
 
-
-typedef struct BlockIOLimit {
-    int64_t bps[3];
-    int64_t iops[3];
-} BlockIOLimit;
-
-typedef struct BlockIOBaseValue {
-    uint64_t bytes[2];
-    uint64_t ios[2];
-} BlockIOBaseValue;
-
 struct BlockDriver {
     const char *format_name;
     int instance_size;
@@ -263,13 +246,10 @@ struct BlockDriverState {
     /* number of in-flight copy-on-read requests */
     unsigned int copy_on_read_in_flight;
 
-    /* the time for latest disk I/O */
-    int64_t slice_start;
-    int64_t slice_end;
-    BlockIOLimit io_limits;
-    BlockIOBaseValue slice_submitted;
-    CoQueue      throttled_reqs;
-    QEMUTimer    *block_timer;
+    /* I/O throttling */
+    ThrottleState throttle_state;
+    /* two throttled request queue so read don't starve write nor the reverse */
+    CoQueue      throttled_reqs[2];
     bool         io_limits_enabled;
 
     /* I/O stats (display with "info blockstats"). */
@@ -308,7 +288,8 @@ struct BlockDriverState {
 int get_tmp_filename(char *filename, int size);
 
 void bdrv_set_io_limits(BlockDriverState *bs,
-                        BlockIOLimit *io_limits);
+                        ThrottleConfig *cfg);
+
 
 /**
  * bdrv_add_before_write_notifier:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling
  2013-08-02 15:53 [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Benoît Canet
  2013-08-02 15:53 ` [Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket Benoît Canet
  2013-08-02 15:53 ` [Qemu-devel] [RFC V3 2/2] block: Enable the new throttling code in the block layer Benoît Canet
@ 2013-08-06  9:22 ` Fam Zheng
  2013-08-07  8:31 ` Stefan Hajnoczi
  3 siblings, 0 replies; 6+ messages in thread
From: Fam Zheng @ 2013-08-06  9:22 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, pbonzini, qemu-devel, stefanha

On Fri, 08/02 17:53, Benoît Canet wrote:
> This patchset implement continous leaky bucket throttling.
> 
> It works mostly on the general case.
> The exception is where the load is composed of both reads and writes and two
> limits iops_rd and iops_wr are set.
> The resulting iops are a little above half of the given limits.
> I tried various strategies to avoid this: two timer, two throttled request
> queues or even a different algorithm using a priority queue.
> The problem is still the same in every version of the code: reads and writes
> operation seems entangled.
> 
> Benoît Canet (2):
>   throttle: Add a new throttling API implementing continuus leaky

s/continuus/continuous/

>     bucket.
>   block: Enable the new throttling code in the block layer.
> 
>  block.c                   |  316 ++++++++------------------------
>  block/qapi.c              |   21 +--
>  blockdev.c                |  115 ++++++------
>  include/block/block.h     |    1 -
>  include/block/block_int.h |   33 +---
>  include/qemu/throttle.h   |  111 ++++++++++++
>  util/Makefile.objs        |    1 +
>  util/throttle.c           |  436 +++++++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 698 insertions(+), 336 deletions(-)
>  create mode 100644 include/qemu/throttle.h
>  create mode 100644 util/throttle.c
> 
> -- 
> 1.7.10.4
> 
> 

-- 
Fam

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling
  2013-08-02 15:53 [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Benoît Canet
                   ` (2 preceding siblings ...)
  2013-08-06  9:22 ` [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Fam Zheng
@ 2013-08-07  8:31 ` Stefan Hajnoczi
  2013-08-07 21:23   ` Benoît Canet
  3 siblings, 1 reply; 6+ messages in thread
From: Stefan Hajnoczi @ 2013-08-07  8:31 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, pbonzini, qemu-devel

On Fri, Aug 02, 2013 at 05:53:00PM +0200, Benoît Canet wrote:
> This patchset implement continous leaky bucket throttling.
> 
> It works mostly on the general case.
> The exception is where the load is composed of both reads and writes and two
> limits iops_rd and iops_wr are set.
> The resulting iops are a little above half of the given limits.
> I tried various strategies to avoid this: two timer, two throttled request
> queues or even a different algorithm using a priority queue.
> The problem is still the same in every version of the code: reads and writes
> operation seems entangled.
> 
> Benoît Canet (2):
>   throttle: Add a new throttling API implementing continuus leaky
>     bucket.
>   block: Enable the new throttling code in the block layer.
> 
>  block.c                   |  316 ++++++++------------------------
>  block/qapi.c              |   21 +--
>  blockdev.c                |  115 ++++++------
>  include/block/block.h     |    1 -
>  include/block/block_int.h |   33 +---
>  include/qemu/throttle.h   |  111 ++++++++++++
>  util/Makefile.objs        |    1 +
>  util/throttle.c           |  436 +++++++++++++++++++++++++++++++++++++++++++++
>  8 files changed, 698 insertions(+), 336 deletions(-)
>  create mode 100644 include/qemu/throttle.h
>  create mode 100644 util/throttle.c

I saw more discussion on IRC.  Does this mean you will send another
revision to address outstanding issues?

Just wanted to check if you are waiting for code review or if you are
still developing the next patch revision.

Stefan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling
  2013-08-07  8:31 ` Stefan Hajnoczi
@ 2013-08-07 21:23   ` Benoît Canet
  0 siblings, 0 replies; 6+ messages in thread
From: Benoît Canet @ 2013-08-07 21:23 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, pbonzini, qemu-devel

> I saw more discussion on IRC.  Does this mean you will send another
> revision to address outstanding issues?
> 
> Just wanted to check if you are waiting for code review or if you are
> still developing the next patch revision.

I am currently finishing to write unit tests for the next patch revision.
Once unit tests are done I will rebase the previous QMP throttling patches on
top of the new code then post the new series.

I think that the result will be a net improvement.

Thanks for asking

Best regards

Benoît

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-08-07 21:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-02 15:53 [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Benoît Canet
2013-08-02 15:53 ` [Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket Benoît Canet
2013-08-02 15:53 ` [Qemu-devel] [RFC V3 2/2] block: Enable the new throttling code in the block layer Benoît Canet
2013-08-06  9:22 ` [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling Fam Zheng
2013-08-07  8:31 ` Stefan Hajnoczi
2013-08-07 21:23   ` Benoît Canet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).