From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54582) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VGKVQ-000111-Py for qemu-devel@nongnu.org; Sun, 01 Sep 2013 23:08:49 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1VGKVM-00085j-A5 for qemu-devel@nongnu.org; Sun, 01 Sep 2013 23:08:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:27144) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1VGKVL-00084V-W1 for qemu-devel@nongnu.org; Sun, 01 Sep 2013 23:08:40 -0400 Date: Mon, 2 Sep 2013 11:08:32 +0800 From: Fam Zheng Message-ID: <20130902030832.GA9925@T430s.nay.redhat.com> References: <1378053587-12121-1-git-send-email-benoit@irqsave.net> <1378053587-12121-2-git-send-email-benoit@irqsave.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: <1378053587-12121-2-git-send-email-benoit@irqsave.net> Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH V11 1/5] throttle: Add a new throttling API implementing continuous leaky bucket. Reply-To: famz@redhat.com List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: =?iso-8859-1?Q?Beno=EEt?= Canet Cc: kwolf@redhat.com, stefanha@gmail.com, qemu-devel@nongnu.org, stefanha@redhat.com, pbonzini@redhat.com On Sun, 09/01 18:39, Beno=EEt Canet wrote: > Implement the continuous leaky bucket algorithm devised on IRC as a sep= arate > module. >=20 > Signed-off-by: Benoit Canet > --- > include/qemu/throttle.h | 103 ++++++++++++ > util/Makefile.objs | 1 + > util/throttle.c | 396 +++++++++++++++++++++++++++++++++++++++= ++++++++ > 3 files changed, 500 insertions(+) > create mode 100644 include/qemu/throttle.h > create mode 100644 util/throttle.c >=20 > diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h > new file mode 100644 > index 0000000..823650d > --- /dev/null > +++ b/include/qemu/throttle.h > @@ -0,0 +1,103 @@ > +/* > + * QEMU throttling infrastructure > + * > + * Copyright (C) Nodalink, SARL. 2013 > + * > + * Author: > + * Beno=EEt Canet > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation; either version 2 or > + * (at your option) version 3 of the License. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, see . > + */ > + > +#ifndef THROTTLE_H > +#define THROTTLE_H > + > +#include > +#include "qemu-common.h" > +#include "qemu/timer.h" > + > +#define NANOSECONDS_PER_SECOND 1000000000.0 > + > +typedef enum { > + THROTTLE_BPS_TOTAL, > + THROTTLE_BPS_READ, > + THROTTLE_BPS_WRITE, > + THROTTLE_OPS_TOTAL, > + THROTTLE_OPS_READ, > + THROTTLE_OPS_WRITE, > + BUCKETS_COUNT, > +} BucketType; > + > +typedef struct LeakyBucket { > + double avg; /* average goal in units per second */ > + double max; /* leaky bucket max burst in units */ > + double level; /* bucket level in units */ > +} LeakyBucket; > + > +/* The following structure is used to configure a ThrottleState > + * It contains a bit of state: the bucket field of the LeakyBucket str= ucture. > + * However it allows to keep the code clean and the bucket field is re= set to > + * zero at the right time. > + */ > +typedef struct ThrottleConfig { > + LeakyBucket buckets[BUCKETS_COUNT]; /* leaky buckets */ > + uint64_t op_size; /* size of an operation in bytes */ > +} ThrottleConfig; > + > +typedef struct ThrottleState { > + ThrottleConfig cfg; /* configuration */ > + int64_t previous_leak; /* timestamp of the last leak done */ > + QEMUTimer * timers[2]; /* timers used to do the throttling */ > + QEMUClockType clock_type; /* the clock used */ > +} ThrottleState; > + > +/* operations on single leaky buckets */ > +void throttle_leak_bucket(LeakyBucket *bkt, int64_t delta); > + > +int64_t throttle_compute_wait(LeakyBucket *bkt); > + > +/* expose timer computation function for unit tests */ > +bool throttle_compute_timer(ThrottleState *ts, > + bool is_write, > + int64_t now, > + int64_t *next_timestamp); > + > +/* init/destroy cycle */ > +void throttle_init(ThrottleState *ts, > + QEMUClockType clock_type, > + void (read_timer)(void *), > + void (write_timer)(void *), > + void *timer_opaque); > + > +void throttle_destroy(ThrottleState *ts); > + > +bool throttle_have_timer(ThrottleState *ts); > + > +/* configuration */ > +bool throttle_enabled(ThrottleConfig *cfg); > + > +bool throttle_conflicting(ThrottleConfig *cfg); > + > +bool throttle_is_valid(ThrottleConfig *cfg); > + > +void throttle_config(ThrottleState *ts, ThrottleConfig *cfg); > + > +void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg); > + > +/* usage */ > +bool throttle_schedule_timer(ThrottleState *ts, bool is_write); > + > +void throttle_account(ThrottleState *ts, bool is_write, uint64_t size)= ; > + > +#endif > diff --git a/util/Makefile.objs b/util/Makefile.objs > index dc72ab0..2bb13a2 100644 > --- a/util/Makefile.objs > +++ b/util/Makefile.objs > @@ -11,3 +11,4 @@ util-obj-y +=3D iov.o aes.o qemu-config.o qemu-socket= s.o uri.o notify.o > util-obj-y +=3D qemu-option.o qemu-progress.o > util-obj-y +=3D hexdump.o > util-obj-y +=3D crc32c.o > +util-obj-y +=3D throttle.o > diff --git a/util/throttle.c b/util/throttle.c > new file mode 100644 > index 0000000..cf048b9 > --- /dev/null > +++ b/util/throttle.c > @@ -0,0 +1,396 @@ > +/* > + * QEMU throttling infrastructure > + * > + * Copyright (C) Nodalink, SARL. 2013 > + * > + * Author: > + * Beno=EEt Canet > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public License as > + * published by the Free Software Foundation; either version 2 or > + * (at your option) version 3 of the License. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * You should have received a copy of the GNU General Public License > + * along with this program; if not, see . > + */ > + > +#include "qemu/throttle.h" > +#include "qemu/timer.h" > + > +/* This function make a bucket leak > + * > + * @bkt: the bucket to make leak > + * @delta_ns: the time delta > + */ > +void throttle_leak_bucket(LeakyBucket *bkt, int64_t delta_ns) > +{ > + double leak; > + > + /* compute how much to leak */ > + leak =3D (bkt->avg * (double) delta_ns) / NANOSECONDS_PER_SECOND; > + > + /* make the bucket leak */ > + bkt->level =3D MAX(bkt->level - leak, 0); > +} > + > +/* Calculate the time delta since last leak and make proportionals lea= ks > + * > + * @now: the current timestamp in ns > + */ > +static void throttle_do_leak(ThrottleState *ts, int64_t now) > +{ > + /* compute the time elapsed since the last leak */ > + int64_t delta_ns =3D now - ts->previous_leak; > + int i; > + > + ts->previous_leak =3D now; > + > + if (delta_ns <=3D 0) { > + return; > + } > + > + /* make each bucket leak */ > + for (i =3D 0; i < BUCKETS_COUNT; i++) { > + throttle_leak_bucket(&ts->cfg.buckets[i], delta_ns); > + } > +} > + > +/* do the real job of computing the time to wait > + * > + * @limit: the throttling limit > + * @extra: the number of operation to delay > + * @ret: the time to wait in ns > + */ > +static int64_t throttle_do_compute_wait(double limit, double extra) > +{ > + double wait =3D extra * NANOSECONDS_PER_SECOND; > + wait /=3D limit; > + return wait; > +} > + > +/* This function compute the wait time in ns that a leaky bucket shoul= d trigger > + * > + * @bkt: the leaky bucket we operate on > + * @ret: the resulting wait time in ns or 0 if the operation can go th= rough > + */ > +int64_t throttle_compute_wait(LeakyBucket *bkt) > +{ > + double extra; /* the number of extra units blocking the io */ > + > + if (!bkt->avg) { > + return 0; > + } > + > + extra =3D bkt->level - bkt->max; > + > + if (extra <=3D 0) { > + return 0; > + } > + > + return throttle_do_compute_wait(bkt->avg, extra); > +} > + > +/* This function compute the time that must be waited while this IO > + * > + * @is_write: true if the current IO is a write, false if it's a rea= d > + * @ret: time to wait > + */ > +static int64_t throttle_compute_wait_for(ThrottleState *ts, > + bool is_write) > +{ > + BucketType to_check[2][4] =3D { {THROTTLE_BPS_TOTAL, > + THROTTLE_OPS_TOTAL, > + THROTTLE_BPS_READ, > + THROTTLE_OPS_READ}, > + {THROTTLE_BPS_TOTAL, > + THROTTLE_OPS_TOTAL, > + THROTTLE_BPS_WRITE, > + THROTTLE_OPS_WRITE}, }; > + int64_t wait, max_wait =3D 0; > + int i; > + > + for (i =3D 0; i < 4; i++) { > + BucketType index =3D to_check[is_write][i]; > + wait =3D throttle_compute_wait(&ts->cfg.buckets[index]); > + if (wait > max_wait) { > + max_wait =3D wait; > + } > + } > + > + return max_wait; > +} > + > +/* compute the timer for this type of operation > + * > + * @is_write: the type of operation > + * @now: the current clock timestamp > + * @next_timestamp: the resulting timer > + * @ret: true if a timer must be set > + */ > +bool throttle_compute_timer(ThrottleState *ts, > + bool is_write, > + int64_t now, > + int64_t *next_timestamp) > +{ > + int64_t wait; > + > + /* leak proportionally to the time elapsed */ > + throttle_do_leak(ts, now); > + > + /* compute the wait time if any */ > + wait =3D throttle_compute_wait_for(ts, is_write); > + > + /* if the code must wait compute when the next timer should fire *= / > + if (wait) { > + *next_timestamp =3D now + wait; > + return true; > + } > + > + /* else no need to wait at all */ > + *next_timestamp =3D now; > + return false; > +} > + > +/* To be called first on the ThrottleState */ > +void throttle_init(ThrottleState *ts, > + QEMUClockType clock_type, > + QEMUTimerCB *read_timer_cb, > + QEMUTimerCB *write_timer_cb, > + void *timer_opaque) > +{ > + memset(ts, 0, sizeof(ThrottleState)); > + > + ts->clock_type =3D clock_type; > + ts->timers[0] =3D timer_new_ns(clock_type, read_timer_cb, timer_op= aque); > + ts->timers[1] =3D timer_new_ns(clock_type, write_timer_cb, timer_o= paque); > +} > + > +/* destroy a timer */ > +static void throttle_timer_destroy(QEMUTimer **timer) > +{ > + assert(*timer !=3D NULL); > + > + timer_del(*timer); > + timer_free(*timer); > + *timer =3D NULL; > +} > + > +/* To be called last on the ThrottleState */ > +void throttle_destroy(ThrottleState *ts) > +{ > + int i; > + > + for (i =3D 0; i < 2; i++) { > + throttle_timer_destroy(&ts->timers[i]); > + } > +} > + > +/* is any throttling timer configured */ > +bool throttle_have_timer(ThrottleState *ts) > +{ > + if (ts->timers[0]) { > + return true; > + } > + > + return false; > +} > + > +/* Does any throttling must be done > + * > + * @cfg: the throttling configuration to inspect > + * @ret: true if throttling must be done else false > + */ > +bool throttle_enabled(ThrottleConfig *cfg) > +{ > + int i; > + > + for (i =3D 0; i < BUCKETS_COUNT; i++) { > + if (cfg->buckets[i].avg > 0) { > + return true; > + } > + } > + > + return false; > +} > + > +/* return true if any two throttling parameters conflicts > + * > + * @cfg: the throttling configuration to inspect > + * @ret: true if any conflict detected else false > + */ > +bool throttle_conflicting(ThrottleConfig *cfg) > +{ > + bool bps_flag, ops_flag; > + bool bps_max_flag, ops_max_flag; > + > + bps_flag =3D cfg->buckets[THROTTLE_BPS_TOTAL].avg && > + (cfg->buckets[THROTTLE_BPS_READ].avg || > + cfg->buckets[THROTTLE_BPS_WRITE].avg); > + > + ops_flag =3D cfg->buckets[THROTTLE_OPS_TOTAL].avg && > + (cfg->buckets[THROTTLE_OPS_READ].avg || > + cfg->buckets[THROTTLE_OPS_WRITE].avg); > + > + bps_max_flag =3D cfg->buckets[THROTTLE_BPS_TOTAL].max && > + (cfg->buckets[THROTTLE_BPS_READ].max || > + cfg->buckets[THROTTLE_BPS_WRITE].max); > + > + ops_max_flag =3D cfg->buckets[THROTTLE_OPS_TOTAL].max && > + (cfg->buckets[THROTTLE_OPS_READ].max || > + cfg->buckets[THROTTLE_OPS_WRITE].max); > + > + return bps_flag || ops_flag || bps_max_flag || ops_max_flag; > +} > + > +/* check if a throttling configuration is valid > + * @cfg: the throttling configuration to inspect > + * @ret: true if valid else false > + */ > +bool throttle_is_valid(ThrottleConfig *cfg) > +{ > + bool invalid =3D false; > + int i; > + > + for (i =3D 0; i < BUCKETS_COUNT; i++) { > + if (cfg->buckets[i].avg < 0) { > + invalid =3D true; > + } > + } > + > + for (i =3D 0; i < BUCKETS_COUNT; i++) { > + if (cfg->buckets[i].max < 0) { > + invalid =3D true; > + } > + } > + > + return !invalid; > +} > + > +/* fix bucket parameters */ > +static void throttle_fix_bucket(LeakyBucket *bkt) > +{ > + double min; > + > + /* zero bucket level */ > + bkt->level =3D 0; > + > + /* The following is done to cope with the Linux CFQ block schedule= r > + * which regroup reads and writes by block of 100ms in the guest. > + * When they are two process one making reads and one making write= s cfq > + * make a pattern looking like the following: > + * WWWWWWWWWWWRRRRRRRRRRRRRRWWWWWWWWWWWWWwRRRRRRRRRRRRRRRRR > + * Having a max burst value of 100ms of the average will help smoo= th the > + * throttling > + */ > + min =3D bkt->avg / 10; > + if (bkt->avg && !bkt->max) { > + bkt->max =3D min; > + } > +} > + > +/* take care of canceling a timer */ > +static void throttle_cancel_timer(QEMUTimer *timer) > +{ > + assert(timer !=3D NULL); > + > + timer_del(timer); > +} > + > +/* Used to configure the throttle > + * > + * @ts: the throttle state we are working on > + * @cfg: the config to set > + */ > +void throttle_config(ThrottleState *ts, ThrottleConfig *cfg) > +{ > + int i; > + > + ts->cfg =3D *cfg; > + > + for (i =3D 0; i < BUCKETS_COUNT; i++) { > + throttle_fix_bucket(&ts->cfg.buckets[i]); > + } > + > + ts->previous_leak =3D qemu_clock_get_ns(ts->clock_type); > + > + for (i =3D 0; i < 2; i++) { > + throttle_cancel_timer(ts->timers[i]); > + } > +} > + > +/* used to get config > + * > + * @ts: the throttle state we are working on > + * @cfg: the config to write > + */ > +void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg) > +{ > + *cfg =3D ts->cfg; > +} > + > + > +/* Schedule the read or write timer if needed > + * > + * NOTE: this function is not unit tested due to it's usage of timer_m= od > + * > + * @is_write: the type of operation (read/write) > + * @ret: true if the timer has been scheduled else false > + */ > +bool throttle_schedule_timer(ThrottleState *ts, bool is_write) > +{ > + int64_t now =3D qemu_clock_get_ns(ts->clock_type); > + int64_t next_timestamp; > + bool must_wait; > + > + must_wait =3D throttle_compute_timer(ts, > + is_write, > + now, > + &next_timestamp); > + > + /* request not throttled */ > + if (!must_wait) { > + return false; > + } > + > + /* request throttled and timer pending -> do nothing */ > + if (timer_pending(ts->timers[is_write])) { > + return true; > + } > + > + /* request throttled and timer not pending -> arm timer */ > + timer_mod(ts->timers[is_write], next_timestamp); > + return true; > +} > + > +/* do the accounting for this operation > + * > + * @is_write: the type of operation (read/write) > + * @size: the size of the operation > + */ > +void throttle_account(ThrottleState *ts, bool is_write, uint64_t size) > +{ > + double units =3D 1.0; > + > + /* if cfg.op_size is not defined we will account exactly 1 operati= on */ > + if (ts->cfg.op_size) { > + units =3D (double) size / ts->cfg.op_size; > + } If op_size is non-zero, iops limits are merely a fixed proportion of bps limits, which means the lower set of the two is applied and the higher sk= ipped. I understand the amazon uses op_size like accounting for big IO requests,= but we don't do it condionally on io size or anything here, so that once user= sets op_size, it simply kicks either bps_{,rd,wr} or iops_{,rd,wr} out the gam= e, is that true? Fam > + > + ts->cfg.buckets[THROTTLE_BPS_TOTAL].level +=3D size; > + ts->cfg.buckets[THROTTLE_OPS_TOTAL].level +=3D units; > + > + if (is_write) { > + ts->cfg.buckets[THROTTLE_BPS_WRITE].level +=3D size; > + ts->cfg.buckets[THROTTLE_OPS_WRITE].level +=3D units; > + } else { > + ts->cfg.buckets[THROTTLE_BPS_READ].level +=3D size; > + ts->cfg.buckets[THROTTLE_OPS_READ].level +=3D units; > + } > +} > + > --=20 > 1.7.10.4 >=20 >=20