From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:36944) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R0TM2-0008UC-HD for qemu-devel@nongnu.org; Mon, 05 Sep 2011 03:12:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1R0TM0-0003Ci-Ej for qemu-devel@nongnu.org; Mon, 05 Sep 2011 03:12:26 -0400 Received: from e23smtp03.au.ibm.com ([202.81.31.145]:52536) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1R0TLz-0003BE-Fz for qemu-devel@nongnu.org; Mon, 05 Sep 2011 03:12:24 -0400 Received: from d23relay04.au.ibm.com (d23relay04.au.ibm.com [202.81.31.246]) by e23smtp03.au.ibm.com (8.14.4/8.13.1) with ESMTP id p8576VJJ019509 for ; Mon, 5 Sep 2011 17:06:31 +1000 Received: from d23av02.au.ibm.com (d23av02.au.ibm.com [9.190.235.138]) by d23relay04.au.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id p857ANOv1614050 for ; Mon, 5 Sep 2011 17:10:23 +1000 Received: from d23av02.au.ibm.com (loopback [127.0.0.1]) by d23av02.au.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id p857BspI030013 for ; Mon, 5 Sep 2011 17:11:55 +1000 Date: Mon, 5 Sep 2011 15:10:58 +0800 From: Zhi Yong Wu Message-ID: <20110905071058.GI19143@f15.cn.ibm.com> References: <1314877456-19521-1-git-send-email-wuzhy@linux.vnet.ibm.com> <1314877456-19521-4-git-send-email-wuzhy@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH v6 3/4] block: add block timer and block throttling algorithm List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Stefan Hajnoczi Cc: qemu-devel@nongnu.org On Thu, Sep 01, 2011 at 02:36:41PM +0100, Stefan Hajnoczi wrote: >Date: Thu, 1 Sep 2011 14:36:41 +0100 >Message-ID: >From: Stefan Hajnoczi >To: Zhi Yong Wu >Content-Type: text/plain; charset=3DISO-8859-1 >Content-Transfer-Encoding: quoted-printable >X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) >X-Received-From: 209.85.216.45 >Cc: kwolf@redhat.com, aliguori@us.ibm.com, stefanha@linux.vnet.ibm.com, > kvm@vger.kernel.org, mtosatti@redhat.com, qemu-devel@nongnu.org, > pair@us.ibm.com, zwu.kernel@gmail.com, ryanh@us.ibm.com >Subject: Re: [Qemu-devel] [PATCH v6 3/4] block: add block timer and bloc= k > throttling algorithm >X-BeenThere: qemu-devel@nongnu.org >X-Mailman-Version: 2.1.14 >Precedence: list >List-Id: >List-Unsubscribe: , > >List-Archive: >List-Post: >List-Help: >List-Subscribe: , > >X-Mailman-Copy: yes >Errors-To: qemu-devel-bounces+wuzhy=3Dlinux.vnet.ibm.com@nongnu.org >Sender: qemu-devel-bounces+wuzhy=3Dlinux.vnet.ibm.com@nongnu.org >X-Brightmail-Tracker: AAAAAA=3D=3D >X-Xagent-From: stefanha@gmail.com >X-Xagent-To: wuzhy@linux.vnet.ibm.com >X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGENTU8 at BLDGATE) > >On Thu, Sep 1, 2011 at 12:44 PM, Zhi Yong Wu = wrote: >> Note: >> =C2=A0 =C2=A0 1.) When bps/iops limits are specified to a small value = such as 511 bytes/s, this VM will hang up. We are considering how to hand= le this senario. >> =C2=A0 =C2=A0 2.) When "dd" command is issued in guest, if its option = bs is set to a large value such as "bs=3D1024K", the result speed will sl= ightly bigger than the limits. >> >> For these problems, if you have nice thought, pls let us know.:) >> >> Signed-off-by: Zhi Yong Wu >> --- >> =C2=A0block.c =C2=A0 =C2=A0 | =C2=A0290 ++++++++++++++++++++++++++++++= +++++++++++++++++++++++++++-- >> =C2=A0block.h =C2=A0 =C2=A0 | =C2=A0 =C2=A05 + >> =C2=A0block_int.h | =C2=A0 =C2=A09 ++ >> =C2=A03 files changed, 296 insertions(+), 8 deletions(-) >> >> diff --git a/block.c b/block.c >> index 17ee3df..680f1e7 100644 >> --- a/block.c >> +++ b/block.c >> @@ -30,6 +30,9 @@ >> =C2=A0#include "qemu-objects.h" >> =C2=A0#include "qemu-coroutine.h" >> >> +#include "qemu-timer.h" >> +#include "block/blk-queue.h" >> + >> =C2=A0#ifdef CONFIG_BSD >> =C2=A0#include >> =C2=A0#include >> @@ -72,6 +75,13 @@ static int coroutine_fn bdrv_co_writev_em(BlockDriv= erState *bs, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0QEMUIOVector *iov); >> =C2=A0static int coroutine_fn bdrv_co_flush_em(BlockDriverState *bs); >> >> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_secto= rs, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bool is_write, double elapsed_time, uint6= 4_t *wait); >> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_wri= te, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0double elapsed_time, uint64_t *wait); >> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sector= s, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bool is_write, uint64_t *wait); >> + >> =C2=A0static QTAILQ_HEAD(, BlockDriverState) bdrv_states =3D >> =C2=A0 =C2=A0 QTAILQ_HEAD_INITIALIZER(bdrv_states); >> >> @@ -104,6 +114,64 @@ int is_windows_drive(const char *filename) >> =C2=A0} >> =C2=A0#endif >> >> +/* throttling disk I/O limits */ >> +void bdrv_io_limits_disable(BlockDriverState *bs) >> +{ >> + =C2=A0 =C2=A0bs->io_limits_enabled =3D false; >> + >> + =C2=A0 =C2=A0if (bs->block_queue) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_block_queue_flush(bs->block_queue); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_del_block_queue(bs->block_queue); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->block_queue =3D NULL; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0if (bs->block_timer) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_del_timer(bs->block_timer); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_free_timer(bs->block_timer); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->block_timer =3D NULL; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0bs->slice_start[BLOCK_IO_LIMIT_READ] =C2=A0=3D 0; >> + =C2=A0 =C2=A0bs->slice_start[BLOCK_IO_LIMIT_WRITE] =3D 0; >> + >> + =C2=A0 =C2=A0bs->slice_end[BLOCK_IO_LIMIT_READ] =C2=A0 =C2=A0=3D 0; >> + =C2=A0 =C2=A0bs->slice_end[BLOCK_IO_LIMIT_WRITE] =C2=A0 =3D 0; >> +} >> + >> +static void bdrv_block_timer(void *opaque) >> +{ >> + =C2=A0 =C2=A0BlockDriverState *bs =3D opaque; >> + =C2=A0 =C2=A0BlockQueue *queue =3D bs->block_queue; >> + >> + =C2=A0 =C2=A0qemu_block_queue_flush(queue); >> +} >> + >> +void bdrv_io_limits_enable(BlockDriverState *bs) >> +{ >> + =C2=A0 =C2=A0bs->block_queue =C2=A0 =C2=A0=3D qemu_new_block_queue()= ; >> + =C2=A0 =C2=A0bs->block_timer =C2=A0 =C2=A0=3D qemu_new_timer_ns(vm_c= lock, bdrv_block_timer, bs); >> + >> + =C2=A0 =C2=A0bs->slice_start[BLOCK_IO_LIMIT_READ] =C2=A0=3D qemu_get= _clock_ns(vm_clock); >> + =C2=A0 =C2=A0bs->slice_start[BLOCK_IO_LIMIT_WRITE] =3D >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bs->slice_start[BLOCK_IO_LIMIT_RE= AD]; >> + >> + =C2=A0 =C2=A0bs->slice_end[BLOCK_IO_LIMIT_READ] =C2=A0 =C2=A0=3D >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bs->slice_start[BLOCK_IO_LIMIT_RE= AD] + BLOCK_IO_SLICE_TIME; >> + =C2=A0 =C2=A0bs->slice_end[BLOCK_IO_LIMIT_WRITE] =C2=A0 =3D >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bs->slice_end[BLOCK_IO_LIMIT_READ= ]; >> +} >> + >> +bool bdrv_io_limits_enabled(BlockDriverState *bs) >> +{ >> + =C2=A0 =C2=A0BlockIOLimit *io_limits =3D &bs->io_limits; >> + =C2=A0 =C2=A0return io_limits->bps[BLOCK_IO_LIMIT_READ] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 || io_limits->bps[BLOCK_IO_LIMIT_WRITE] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 || io_limits->bps[BLOCK_IO_LIMIT_TOTAL] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 || io_limits->iops[BLOCK_IO_LIMIT_READ] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 || io_limits->iops[BLOCK_IO_LIMIT_WRITE] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 || io_limits->iops[BLOCK_IO_LIMIT_TOTAL]= ; >> +} >> + >> =C2=A0/* check if the path starts with ":" */ >> =C2=A0static int path_has_protocol(const char *path) >> =C2=A0{ >> @@ -694,6 +762,11 @@ int bdrv_open(BlockDriverState *bs, const char *f= ilename, int flags, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bs->change_cb(bs->change_opa= que, CHANGE_MEDIA); >> =C2=A0 =C2=A0 } >> >> + =C2=A0 =C2=A0/* throttling disk I/O limits */ >> + =C2=A0 =C2=A0if (bs->io_limits_enabled) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bdrv_io_limits_enable(bs); >> + =C2=A0 =C2=A0} >> + >> =C2=A0 =C2=A0 return 0; >> >> =C2=A0unlink_and_fail: >> @@ -732,6 +805,18 @@ void bdrv_close(BlockDriverState *bs) >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (bs->change_cb) >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bs->change_cb(bs->change_opa= que, CHANGE_MEDIA); >> =C2=A0 =C2=A0 } >> + >> + =C2=A0 =C2=A0/* throttling disk I/O limits */ >> + =C2=A0 =C2=A0if (bs->block_queue) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_del_block_queue(bs->block_queue); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->block_queue =3D NULL; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0if (bs->block_timer) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_del_timer(bs->block_timer); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_free_timer(bs->block_timer); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->block_timer =3D NULL; >> + =C2=A0 =C2=A0} >> =C2=A0} >> >> =C2=A0void bdrv_close_all(void) >> @@ -2290,13 +2375,29 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverSt= ate *bs, int64_t sector_num, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0BlockDriverCompletionFunc= *cb, void *opaque) >> =C2=A0{ >> =C2=A0 =C2=A0 BlockDriver *drv =3D bs->drv; >> + =C2=A0 =C2=A0uint64_t wait_time =3D 0; >> + =C2=A0 =C2=A0BlockDriverAIOCB *ret; >> >> =C2=A0 =C2=A0 trace_bdrv_aio_readv(bs, sector_num, nb_sectors, opaque)= ; >> >> - =C2=A0 =C2=A0if (!drv) >> - =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; >> - =C2=A0 =C2=A0if (bdrv_check_request(bs, sector_num, nb_sectors)) >> + =C2=A0 =C2=A0if (!drv || bdrv_check_request(bs, sector_num, nb_secto= rs)) { >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 return NULL; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0/* throttling disk read I/O */ >> + =C2=A0 =C2=A0if (bs->io_limits_enabled) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (bdrv_exceed_io_limits(bs, nb_sectors,= false, &wait_time)) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D qemu_block_queue_en= queue(bs->block_queue, bs, bdrv_aio_readv, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 sector_num, qiov, nb_sectors, cb, opaque); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_mod_timer(bs->block_ti= mer, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 wait_time + qemu_get_clock_ns(vm_clock)); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.bytes[BLOCK_IO_LIMIT_READ] += =3D >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 (unsigned) nb_sectors * BDRV_SECTOR_SIZE; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.ios[BLOCK_IO_LIMIT_READ]++; >> + =C2=A0 =C2=A0} >> >> =C2=A0 =C2=A0 return drv->bdrv_aio_readv(bs, sector_num, qiov, nb_sect= ors, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cb, opaque); >> @@ -2345,15 +2446,14 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverS= tate *bs, int64_t sector_num, >> =C2=A0 =C2=A0 BlockDriver *drv =3D bs->drv; >> =C2=A0 =C2=A0 BlockDriverAIOCB *ret; >> =C2=A0 =C2=A0 BlockCompleteData *blk_cb_data; >> + =C2=A0 =C2=A0uint64_t wait_time =3D 0; >> >> =C2=A0 =C2=A0 trace_bdrv_aio_writev(bs, sector_num, nb_sectors, opaque= ); >> >> - =C2=A0 =C2=A0if (!drv) >> - =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; >> - =C2=A0 =C2=A0if (bs->read_only) >> - =C2=A0 =C2=A0 =C2=A0 =C2=A0return NULL; >> - =C2=A0 =C2=A0if (bdrv_check_request(bs, sector_num, nb_sectors)) >> + =C2=A0 =C2=A0if (!drv || bs->read_only >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0|| bdrv_check_request(bs, sector_num, nb_= sectors)) { >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 return NULL; >> + =C2=A0 =C2=A0} >> >> =C2=A0 =C2=A0 if (bs->dirty_bitmap) { >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 blk_cb_data =3D blk_dirty_cb_alloc(bs, sec= tor_num, nb_sectors, cb, >> @@ -2362,6 +2462,17 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverSt= ate *bs, int64_t sector_num, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 opaque =3D blk_cb_data; >> =C2=A0 =C2=A0 } >> >> + =C2=A0 =C2=A0/* throttling disk write I/O */ >> + =C2=A0 =C2=A0if (bs->io_limits_enabled) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (bdrv_exceed_io_limits(bs, nb_sectors,= true, &wait_time)) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0ret =3D qemu_block_queue_en= queue(bs->block_queue, bs, bdrv_aio_writev, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sector_num, qiov, nb_sec= tors, cb, opaque); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0qemu_mod_timer(bs->block_ti= mer, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0wait_time + qemu_get_clo= ck_ns(vm_clock)); >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0return ret; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> + =C2=A0 =C2=A0} >> + >> =C2=A0 =C2=A0 ret =3D drv->bdrv_aio_writev(bs, sector_num, qiov, nb_se= ctors, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0cb, opaque); >> >> @@ -2369,6 +2480,12 @@ BlockDriverAIOCB *bdrv_aio_writev(BlockDriverSt= ate *bs, int64_t sector_num, >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 if (bs->wr_highest_sector < sector_num + n= b_sectors - 1) { >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bs->wr_highest_sector =3D se= ctor_num + nb_sectors - 1; >> =C2=A0 =C2=A0 =C2=A0 =C2=A0 } >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (bs->io_limits_enabled) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.bytes[BLOCK_IO= _LIMIT_WRITE] +=3D >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (unsigned) nb_sectors * BDRV_SECTOR_S= IZE; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.ios[BLOCK_IO_L= IMIT_WRITE]++; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> =C2=A0 =C2=A0 } >> >> =C2=A0 =C2=A0 return ret; >> @@ -2633,6 +2750,163 @@ void bdrv_aio_cancel(BlockDriverAIOCB *acb) >> =C2=A0 =C2=A0 acb->pool->cancel(acb); >> =C2=A0} >> >> +static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_secto= rs, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bool is_writ= e, double elapsed_time, uint64_t *wait) { >> + =C2=A0 =C2=A0uint64_t bps_limit =3D 0; >> + =C2=A0 =C2=A0double =C2=A0 bytes_limit, bytes_disp, bytes_res; >> + =C2=A0 =C2=A0double =C2=A0 slice_time, wait_time; >> + >> + =C2=A0 =C2=A0if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bps_limit =3D bs->io_limits.bps[BLOCK_IO_= LIMIT_TOTAL]; >> + =C2=A0 =C2=A0} else if (bs->io_limits.bps[is_write]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bps_limit =3D bs->io_limits.bps[is_write]= ; >> + =C2=A0 =C2=A0} else { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D 0; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0return false; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0slice_time =3D bs->slice_end[is_write] - bs->slice_star= t[is_write]; >> + =C2=A0 =C2=A0slice_time /=3D (NANOSECONDS_PER_SECOND); >> + =C2=A0 =C2=A0bytes_limit =3D bps_limit * slice_time; >> + =C2=A0 =C2=A0bytes_disp =C2=A0=3D bs->io_disps.bytes[is_write]; >> + =C2=A0 =C2=A0if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bytes_disp +=3D bs->io_disps.bytes[!is_wr= ite]; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0bytes_res =C2=A0 =3D (unsigned) nb_sectors * BDRV_SECTO= R_SIZE; >> + >> + =C2=A0 =C2=A0if (bytes_disp + bytes_res <=3D bytes_limit) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D 0; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0return false; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0/* Calc approx time to dispatch */ >> + =C2=A0 =C2=A0wait_time =3D (bytes_disp + bytes_res) / bps_limit - el= apsed_time; >> + >> + =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D wait_time * BLOCK_IO_SLICE_TIME= * 10; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0return true; >> +} >> + >> +static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_wri= te, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 double elapsed_time, uint64_t *wait) { >> + =C2=A0 =C2=A0uint64_t iops_limit =3D 0; >> + =C2=A0 =C2=A0double =C2=A0 ios_limit, ios_disp; >> + =C2=A0 =C2=A0double =C2=A0 slice_time, wait_time; >> + >> + =C2=A0 =C2=A0if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0iops_limit =3D bs->io_limits.iops[BLOCK_I= O_LIMIT_TOTAL]; >> + =C2=A0 =C2=A0} else if (bs->io_limits.iops[is_write]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0iops_limit =3D bs->io_limits.iops[is_writ= e]; >> + =C2=A0 =C2=A0} else { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D 0; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0return false; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0slice_time =3D bs->slice_end[is_write] - bs->slice_star= t[is_write]; >> + =C2=A0 =C2=A0slice_time /=3D (NANOSECONDS_PER_SECOND); >> + =C2=A0 =C2=A0ios_limit =C2=A0=3D iops_limit * slice_time; >> + =C2=A0 =C2=A0ios_disp =C2=A0 =3D bs->io_disps.ios[is_write]; >> + =C2=A0 =C2=A0if (bs->io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0ios_disp +=3D bs->io_disps.ios[!is_write]= ; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0if (ios_disp + 1 <=3D ios_limit) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D 0; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0return false; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0/* Calc approx time to dispatch */ >> + =C2=A0 =C2=A0wait_time =3D (ios_disp + 1) / iops_limit; >> + =C2=A0 =C2=A0if (wait_time > elapsed_time) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0wait_time =3D wait_time - elapsed_time; >> + =C2=A0 =C2=A0} else { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0wait_time =3D 0; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D wait_time * BLOCK_IO_SLICE_TIME= * 10; >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0return true; >> +} >> + >> +static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sector= s, >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 bool is_write, uint64_t *wait) { >> + =C2=A0 =C2=A0int64_t =C2=A0now; >> + =C2=A0 =C2=A0uint64_t bps_wait =3D 0, iops_wait =3D 0, max_wait; >> + =C2=A0 =C2=A0double =C2=A0 elapsed_time; >> + =C2=A0 =C2=A0int =C2=A0 =C2=A0 =C2=A0bps_ret, iops_ret; >> + >> + =C2=A0 =C2=A0now =3D qemu_get_clock_ns(vm_clock); >> + =C2=A0 =C2=A0if ((bs->slice_start[is_write] < now) >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0&& (bs->slice_end[is_write] > now)) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->slice_end[is_write] =C2=A0 =3D now + = BLOCK_IO_SLICE_TIME; >> + =C2=A0 =C2=A0} else { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->slice_start[is_write] =C2=A0 =C2=A0 =3D= now; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->slice_end[is_write] =C2=A0 =C2=A0 =C2= =A0 =3D now + BLOCK_IO_SLICE_TIME; >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.bytes[is_write] =C2=A0=3D 0; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.bytes[!is_write] =3D 0; >> + >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.ios[is_write] =C2=A0 =C2=A0=3D= 0; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0bs->io_disps.ios[!is_write] =C2=A0 =3D 0; > >Does it make sense to keep separate slice_start/slice_end for read and >write since we reset the dispatched statistics to zero for both? >Perhaps we should use a scalar slice_start/slice_end and not two >separate values for read/write. Right, currently the scalar slice_* should be adopted, thanks. > >> + =C2=A0 =C2=A0} >> + >> + =C2=A0 =C2=A0/* If a limit was exceeded, immediately queue this requ= est */ >> + =C2=A0 =C2=A0if (qemu_block_queue_has_pending(bs->block_queue)) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0if (bs->io_limits.bps[BLOCK_IO_LIMIT_TOTA= L] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|| bs->io_limits.bps[is_wri= te] || bs->io_limits.iops[is_write] >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|| bs->io_limits.iops[BLOCK= _IO_LIMIT_TOTAL]) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0if (wait) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0*wait =3D 0; > >This causes the queue to be flushed each time the guest enqueues an >I/O while there are queued requests. Perhaps this is (part of) the >CPU overhead that Ryan's benchmarking discovered. > >If we try to preserve request ordering then I don't think there is a >reason to modify the timer once it has been set. good catch, pls check the latest changes on my public git branch. > >Stefan >