[Qemu-devel] [PATCH] Fix I/O throttling pathologic oscillating behavior

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH] Fix I/O throttling pathologic oscillating behavior
@ 2013-03-20  9:12 Benoît Canet
  2013-03-20  9:12 ` [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation Benoît Canet
  0 siblings, 1 reply; 14+ messages in thread
From: Benoît Canet @ 2013-03-20  9:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, wuzhy, Benoît Canet, stefanha

Limiting a virtio device backed by a QED file at 150 iops with
"-drive  file=test.qed,if=virtio,cache=none,iops=150" and running in the guest
the load.c program lead to an iops oscillating behavior which amplify itself
as the time goes on.

At first an extract of "iostats -x -d 2" would look like this:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
vdb               0,00     4,11    0,10  149,09     0,39   614,02     8,24
vdb               0,00     0,00    0,00  163,27     0,00   653,06     8,00
vdb               0,00     0,00    0,00  164,65     0,00   658,59     8,00
vdb               0,00     0,00    0,00   84,85     0,00   339,39     8,00
vdb               0,00     0,00    0,00  170,71     0,00   682,83     8,00
vdb               0,00     0,00    0,00  185,71     0,00   742,86     8,00
vdb               0,00     0,00    0,00  174,75     0,00   698,99     8,00
vdb               0,00     0,00    0,00   87,88     0,00   351,52     8,00

w/s seems ok

After a few moment it would look like this:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
vdb               0,00     0,00    0,00  249,00     0,00   996,00     8,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00  260,00     0,00  1040,00     8,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00  250,00     0,00  1000,00     8,00
vdb               0,00     0,00    0,00  249,49     0,00   997,98     8,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00

Here w/s start to oscillate on a few second cycle.

Waiting around ten hours leads to this.
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz
db               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00  267,68     0,00  1070,71     8,00
vdb               0,00     0,00    0,00 1184,38     0,00  4737,50     8,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00 1415,15     0,00  5660,61     8,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
vdb               0,00     0,00    0,00  939,18     0,00  3756,70     8,00
vdb               0,00     0,00    0,00  523,71     0,00  2094,85     8,00

It seems to get worse with the amplitude and duration of the cycle growing.

As the cause of the amplification of the oscillation seems to be the growing of
the cycle duration I wrote the "block: fix bdrv_exceed_iops_limits wait
computation" patch which solve this problem.

load.c - beware of hardcoded "/dev/vdb"
-------------------------------------------------------------------------------
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>

#define BLOCKSIZE 4096
#define THREAD_COUNT 50
#define IOS 100000

pthread_mutex_t mutex;
pthread_cond_t  condition;

void *io_thread_body(void *opaque)
{
    int j, fd;
    int i = (int) *(int *) opaque;
    void *buffer;

    posix_memalign(&buffer, BLOCKSIZE, BLOCKSIZE);

    fd = open("/dev/vdb", O_RDWR|O_DIRECT);
    if (fd < 0) {
        printf("Fail to open /dev/vdb in O_DIRECT\n");
        goto exit;
    }


    printf("Waiting for signal\n");
    pthread_mutex_lock(&mutex);
    pthread_cond_wait(&condition, &mutex);
    pthread_mutex_unlock(&mutex);

    printf("Starting ios\n");

    while (1) {
        off_t offset = (i * IOS * BLOCKSIZE) + j * BLOCKSIZE;
        pwrite(fd, buffer, BLOCKSIZE, offset);
        j++;
    }

exit:
    free(opaque);
    pthread_exit(NULL);
}

int main(int argc, char **argv)
{
    int i;
    pthread_t threads[THREAD_COUNT];
    pthread_attr_t attr;

    pthread_mutex_init(&mutex, NULL);
    pthread_cond_init(&condition, NULL);

    pthread_attr_init(&attr);
    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);

    printf("Creating threads\n");
    for (i = 0; i < THREAD_COUNT; i++) {
        int *j = malloc(sizeof(j));
        *j = i;
        pthread_create(&threads[i], &attr, io_thread_body, j);
    }

    sleep(1);

    printf("Waking up threads\n");
    pthread_cond_broadcast(&condition);

    printf("Waiting threads to finish\n");
    for (i = 0; i < THREAD_COUNT; i++) {
       pthread_join(threads[i], NULL);
    }

    pthread_attr_destroy(&attr);
    pthread_mutex_destroy(&mutex);
    pthread_cond_destroy(&condition);

    return 0;
}
-------------------------------------------------------------------------------

Benoît Canet (1):
  block: fix bdrv_exceed_iops_limits wait computation

 block.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--
1.7.10.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20  9:12 [Qemu-devel] [PATCH] Fix I/O throttling pathologic oscillating behavior Benoît Canet
@ 2013-03-20  9:12 ` Benoît Canet
  2013-03-20 10:55   ` Zhi Yong Wu
  2013-03-20 13:29   ` Stefan Hajnoczi
  0 siblings, 2 replies; 14+ messages in thread
From: Benoît Canet @ 2013-03-20  9:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: kwolf, wuzhy, Benoît Canet, stefanha

This patch fix an I/O throttling behavior triggered by limiting at 150 iops
and running a load of 50 threads doing random pwrites on a raw virtio device.

After a few moments the iops count start to oscillate steadily between 0 and a
value upper than the limit.

As the load keep running the period and the amplitude of the oscillation
increase until io bursts reaching the physical storage max iops count are
done.

These bursts are followed by guest io starvation.

As the period of this oscillation cycle is increasing the cause must be a
computation error leading to increase slowly the wait time.

This patch make the wait time a bit smaller and tests confirm that it solves
the oscillating behavior.

Signed-off-by: Benoit Canet <benoit@irqsave.net>
---
 block.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block.c b/block.c
index 0a062c9..455d8b0 100644
--- a/block.c
+++ b/block.c
@@ -3739,7 +3739,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
     }

     /* Calc approx time to dispatch */
-    wait_time = (ios_base + 1) / iops_limit;
+    wait_time = ios_base / iops_limit;
     if (wait_time > elapsed_time) {
         wait_time = wait_time - elapsed_time;
     } else {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20  9:12 ` [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation Benoît Canet
@ 2013-03-20 10:55   ` Zhi Yong Wu
  2013-03-20 13:29   ` Stefan Hajnoczi
  1 sibling, 0 replies; 14+ messages in thread
From: Zhi Yong Wu @ 2013-03-20 10:55 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, qemu-devel, stefanha

Reviewed-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>

On Wed, 2013-03-20 at 10:12 +0100, Benoît Canet wrote:
> This patch fix an I/O throttling behavior triggered by limiting at 150 iops
> and running a load of 50 threads doing random pwrites on a raw virtio device.
> 
> After a few moments the iops count start to oscillate steadily between 0 and a
> value upper than the limit.
> 
> As the load keep running the period and the amplitude of the oscillation
> increase until io bursts reaching the physical storage max iops count are
> done.
> 
> These bursts are followed by guest io starvation.
> 
> As the period of this oscillation cycle is increasing the cause must be a
> computation error leading to increase slowly the wait time.
> 
> This patch make the wait time a bit smaller and tests confirm that it solves
> the oscillating behavior.
> 
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>  block.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 0a062c9..455d8b0 100644
> --- a/block.c
> +++ b/block.c
> @@ -3739,7 +3739,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>      }
> 
>      /* Calc approx time to dispatch */
> -    wait_time = (ios_base + 1) / iops_limit;
> +    wait_time = ios_base / iops_limit;
>      if (wait_time > elapsed_time) {
>          wait_time = wait_time - elapsed_time;
>      } else {

-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20  9:12 ` [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation Benoît Canet
  2013-03-20 10:55   ` Zhi Yong Wu
@ 2013-03-20 13:29   ` Stefan Hajnoczi
  2013-03-20 14:28     ` Stefan Hajnoczi
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2013-03-20 13:29 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, wuzhy, qemu-devel

On Wed, Mar 20, 2013 at 10:12:14AM +0100, Benoît Canet wrote:
> This patch fix an I/O throttling behavior triggered by limiting at 150 iops
> and running a load of 50 threads doing random pwrites on a raw virtio device.
> 
> After a few moments the iops count start to oscillate steadily between 0 and a
> value upper than the limit.
> 
> As the load keep running the period and the amplitude of the oscillation
> increase until io bursts reaching the physical storage max iops count are
> done.
> 
> These bursts are followed by guest io starvation.
> 
> As the period of this oscillation cycle is increasing the cause must be a
> computation error leading to increase slowly the wait time.
> 
> This patch make the wait time a bit smaller and tests confirm that it solves
> the oscillating behavior.
> 
> Signed-off-by: Benoit Canet <benoit@irqsave.net>
> ---
>  block.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block.c b/block.c
> index 0a062c9..455d8b0 100644
> --- a/block.c
> +++ b/block.c
> @@ -3739,7 +3739,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
>      }
>  
>      /* Calc approx time to dispatch */
> -    wait_time = (ios_base + 1) / iops_limit;
> +    wait_time = ios_base / iops_limit;
>      if (wait_time > elapsed_time) {
>          wait_time = wait_time - elapsed_time;
>      } else {

I tried reproducing without your test case:

dd if=/dev/vdb of=/dev/null bs=4096 iflag=direct

I've pasted printfs below which reveals that wait time increases monotonically!
In other words, dd is slowed down more and more as it runs:

bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1426 ms
bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1431 ms
bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1437 ms
bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
...

Killing dd and starting it again resets the accumulated delay (probably because
we end the slice and state is cleared).

This suggests workloads that are constantly at the I/O limit will experience
creeping delay or the oscillations you found.

After applying your patch I observed the opposite behavior: wait time decreases
until it resets itself.  Perhaps we're waiting less and less until we just
finish the slice and all values reset:

bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 496 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 489 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 484 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 480 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 474 ms
...
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 300 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 299 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 298 ms
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 494 ms

I'm not confident that I understand the effects of your patch.  Do you have an
explanation for these results?

More digging will probably be necessary to solve the underlying problem here.

diff --git a/block.c b/block.c
index 0a062c9..7a8c9e6 100644
--- a/block.c
+++ b/block.c
@@ -175,7 +175,9 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
     int64_t wait_time = -1;

     if (!qemu_co_queue_empty(&bs->throttled_reqs)) {
+        fprintf(stderr, "bs %p co %p waiting for throttled_reqs\n", bs, qemu_coroutine_self());
         qemu_co_queue_wait(&bs->throttled_reqs);
+        fprintf(stderr, "bs %p co %p woke up from throttled_reqs\n", bs, qemu_coroutine_self());
     }

     /* In fact, we hope to keep each request's timing, in FIFO mode. The next
@@ -188,7 +190,9 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
     while (bdrv_exceed_io_limits(bs, nb_sectors, is_write, &wait_time)) {
         qemu_mod_timer(bs->block_timer,
                        wait_time + qemu_get_clock_ns(vm_clock));
+        fprintf(stderr, "bs %p co %p throttled for %"PRId64" ms\n", bs, qemu_coroutine_self(), wait_time
         qemu_co_queue_wait_insert_head(&bs->throttled_reqs);
+        fprintf(stderr, "bs %p co %p woke up from throttled_reqs after sleeping\n", bs, qemu_coroutine_s
     }

     qemu_co_queue_next(&bs->throttled_reqs);

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20 13:29   ` Stefan Hajnoczi
@ 2013-03-20 14:28     ` Stefan Hajnoczi
  2013-03-20 14:56       ` Benoît Canet
  2013-03-20 15:27       ` Benoît Canet
  0 siblings, 2 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2013-03-20 14:28 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, wuzhy, Benoît Canet, qemu-devel

On Wed, Mar 20, 2013 at 02:29:24PM +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 20, 2013 at 10:12:14AM +0100, Benoît Canet wrote:
> > This patch fix an I/O throttling behavior triggered by limiting at 150 iops
> > and running a load of 50 threads doing random pwrites on a raw virtio device.
> > 
> > After a few moments the iops count start to oscillate steadily between 0 and a
> > value upper than the limit.
> > 
> > As the load keep running the period and the amplitude of the oscillation
> > increase until io bursts reaching the physical storage max iops count are
> > done.
> > 
> > These bursts are followed by guest io starvation.
> > 
> > As the period of this oscillation cycle is increasing the cause must be a
> > computation error leading to increase slowly the wait time.
> > 
> > This patch make the wait time a bit smaller and tests confirm that it solves
> > the oscillating behavior.
> > 
> > Signed-off-by: Benoit Canet <benoit@irqsave.net>
> > ---
> >  block.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/block.c b/block.c
> > index 0a062c9..455d8b0 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -3739,7 +3739,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
> >      }
> >  
> >      /* Calc approx time to dispatch */
> > -    wait_time = (ios_base + 1) / iops_limit;
> > +    wait_time = ios_base / iops_limit;
> >      if (wait_time > elapsed_time) {
> >          wait_time = wait_time - elapsed_time;
> >      } else {
> 
> I tried reproducing without your test case:
> 
> dd if=/dev/vdb of=/dev/null bs=4096 iflag=direct
> 
> I've pasted printfs below which reveals that wait time increases monotonically!
> In other words, dd is slowed down more and more as it runs:
> 
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1426 ms
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1431 ms
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 throttled for 1437 ms
> bs 0x7f56fe187a30 co 0x7f56fe211cb0 woke up from throttled_reqs after sleeping
> ...
> 
> Killing dd and starting it again resets the accumulated delay (probably because
> we end the slice and state is cleared).
> 
> This suggests workloads that are constantly at the I/O limit will experience
> creeping delay or the oscillations you found.
> 
> After applying your patch I observed the opposite behavior: wait time decreases
> until it resets itself.  Perhaps we're waiting less and less until we just
> finish the slice and all values reset:
> 
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 496 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 489 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 484 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 480 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 474 ms
> ...
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 300 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 299 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 298 ms
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 woke up from throttled_reqs after sleeping
> bs 0x7f2cd2c52a30 co 0x7f2cd2ce3910 throttled for 494 ms
> 
> I'm not confident that I understand the effects of your patch.  Do you have an
> explanation for these results?
> 
> More digging will probably be necessary to solve the underlying problem here.
> 
> diff --git a/block.c b/block.c
> index 0a062c9..7a8c9e6 100644
> --- a/block.c
> +++ b/block.c
> @@ -175,7 +175,9 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
>      int64_t wait_time = -1;
>  
>      if (!qemu_co_queue_empty(&bs->throttled_reqs)) {
> +        fprintf(stderr, "bs %p co %p waiting for throttled_reqs\n", bs, qemu_coroutine_self());
>          qemu_co_queue_wait(&bs->throttled_reqs);
> +        fprintf(stderr, "bs %p co %p woke up from throttled_reqs\n", bs, qemu_coroutine_self());
>      }
>  
>      /* In fact, we hope to keep each request's timing, in FIFO mode. The next
> @@ -188,7 +190,9 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
>      while (bdrv_exceed_io_limits(bs, nb_sectors, is_write, &wait_time)) {
>          qemu_mod_timer(bs->block_timer,
>                         wait_time + qemu_get_clock_ns(vm_clock));
> +        fprintf(stderr, "bs %p co %p throttled for %"PRId64" ms\n", bs, qemu_coroutine_self(), wait_time
>          qemu_co_queue_wait_insert_head(&bs->throttled_reqs);
> +        fprintf(stderr, "bs %p co %p woke up from throttled_reqs after sleeping\n", bs, qemu_coroutine_s
>      }
>  
>      qemu_co_queue_next(&bs->throttled_reqs);
> 

There's something that bothers me:

static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
                             double elapsed_time, uint64_t *wait)
{
...

We extend the slice by incrementing bs->slice_end.  This is done to
account for requests that span slice boundaries.  By extending we keep
the io_base[] statistic so that guests cannot cheat by issuing their
requests at the end of the slice.

But I don't understand why bs->slice_time is modified instead of keeping
it constant at 100 ms:

    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
    if (wait) {
        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
    }

    return true;
}

We'll use bs->slice_time again when a request falls within the current
slice:

static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
                           bool is_write, int64_t *wait)
{
    int64_t  now, max_wait;
    uint64_t bps_wait = 0, iops_wait = 0;
    double   elapsed_time;
    int      bps_ret, iops_ret;

    now = qemu_get_clock_ns(vm_clock);
    if ((bs->slice_start < now)
        && (bs->slice_end > now)) {
        bs->slice_end = now + bs->slice_time;
    } else {

I decided to try the following without your patch:

diff --git a/block.c b/block.c
index 0a062c9..2af2da2 100644
--- a/block.c
+++ b/block.c
@@ -3746,8 +3750,8 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
         wait_time = 0;
     }
 
-    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
-    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
+/*    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10; */
+    bs->slice_end += bs->slice_time; /* - 3 * BLOCK_IO_SLICE_TIME; */
     if (wait) {
         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
     }

Now there is no oscillation and the wait_times do not grow or shrink
under constant load from dd(1).

Can you try this patch by itself to see if it fixes the oscillation?

If yes, we should audit the code a bit more to figure out the best
solution for extending slice times.

Stefan

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20 14:28     ` Stefan Hajnoczi
@ 2013-03-20 14:56       ` Benoît Canet
  2013-03-20 15:12         ` Stefan Hajnoczi
  2013-03-20 15:27       ` Benoît Canet
  1 sibling, 1 reply; 14+ messages in thread
From: Benoît Canet @ 2013-03-20 14:56 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, wuzhy, qemu-devel, Stefan Hajnoczi

> But I don't understand why bs->slice_time is modified instead of keeping
> it constant at 100 ms:
>
>     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
>     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
>     if (wait) {
>         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>     }

In bdrv_exceed_bps_limits there is an equivalent to this with a comment.

---------
  /* When the I/O rate at runtime exceeds the limits,
     * bs->slice_end need to be extended in order that the current statistic
     * info can be kept until the timer fire, so it is increased and tuned
     * based on the result of experiment.
     */
    bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
    bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
    if (wait) {
        *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
    }
----------

Yes I will try your patch.

Regards

Benoît

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20 14:56       ` Benoît Canet
@ 2013-03-20 15:12         ` Stefan Hajnoczi
  2013-03-21  1:18           ` Zhi Yong Wu
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2013-03-20 15:12 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, Stefan Hajnoczi, wuzhy, qemu-devel

On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote:
> > But I don't understand why bs->slice_time is modified instead of keeping
> > it constant at 100 ms:
> >
> >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> >     if (wait) {
> >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> >     }
> 
> In bdrv_exceed_bps_limits there is an equivalent to this with a comment.
> 
> ---------
>   /* When the I/O rate at runtime exceeds the limits,
>      * bs->slice_end need to be extended in order that the current statistic
>      * info can be kept until the timer fire, so it is increased and tuned
>      * based on the result of experiment.
>      */
>     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
>     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
>     if (wait) {
>         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
>     }
> ----------

The comment explains why slice_end needs to be extended, but not why
bs->slice_time should be changed (except that it was tuned as the result
of an experiment).

Zhi Yong: Do you remember a reason for modifying bs->slice_time?

Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20 15:12         ` Stefan Hajnoczi
@ 2013-03-21  1:18           ` Zhi Yong Wu
  2013-03-21  9:17             ` Stefan Hajnoczi
  0 siblings, 1 reply; 14+ messages in thread
From: Zhi Yong Wu @ 2013-03-21  1:18 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Benoît Canet, Stefan Hajnoczi, qemu-devel, kwolf

On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote:
> On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote:
> > > But I don't understand why bs->slice_time is modified instead of keeping
> > > it constant at 100 ms:
> > >
> > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > >     if (wait) {
> > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > >     }
> > 
> > In bdrv_exceed_bps_limits there is an equivalent to this with a comment.
> > 
> > ---------
> >   /* When the I/O rate at runtime exceeds the limits,
> >      * bs->slice_end need to be extended in order that the current statistic
> >      * info can be kept until the timer fire, so it is increased and tuned
> >      * based on the result of experiment.
> >      */
> >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> >     if (wait) {
> >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> >     }
> > ----------
> 
> The comment explains why slice_end needs to be extended, but not why
> bs->slice_time should be changed (except that it was tuned as the result
> of an experiment).
> 
> Zhi Yong: Do you remember a reason for modifying bs->slice_time?
Stefan,
  In some case that the bare I/O speed is very fast on physical machine,
when I/O speed is limited to be one lower value, I/O need to wait for
one relative longer time(i.e. wait_time). You know, wait_time should be
smaller than slice_time, if slice_time is constant, wait_time may not be
its expected value, so the throttling function will not work well.
  For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s,
slice_time is constant, and set to 50ms(a assumed value) or smaller, If
current I/O can be throttled to 1MB/s, its wait_time is expected to
100ms(a assumed value), and is more bigger than current slice_time, I/O
throttling function will not throttle actual I/O speed well. In the
case, slice_time need to be adjusted to one more suitable value which
depends on wait_time.
  In some other case that the bare I/O speed is very slow and I/O
throttling speed is fast, slice_time also need to be adjusted
dynamically based on wait_time.

  If i remember correctly, it's the reason.

> 
> Stefan
> 

-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-21  1:18           ` Zhi Yong Wu
@ 2013-03-21  9:17             ` Stefan Hajnoczi
  2013-03-21 13:04               ` Zhi Yong Wu
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2013-03-21  9:17 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: Benoît Canet, kwolf, qemu-devel, Stefan Hajnoczi

On Thu, Mar 21, 2013 at 09:18:27AM +0800, Zhi Yong Wu wrote:
> On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote:
> > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote:
> > > > But I don't understand why bs->slice_time is modified instead of keeping
> > > > it constant at 100 ms:
> > > >
> > > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > > >     if (wait) {
> > > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > >     }
> > > 
> > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment.
> > > 
> > > ---------
> > >   /* When the I/O rate at runtime exceeds the limits,
> > >      * bs->slice_end need to be extended in order that the current statistic
> > >      * info can be kept until the timer fire, so it is increased and tuned
> > >      * based on the result of experiment.
> > >      */
> > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > >     if (wait) {
> > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > >     }
> > > ----------
> > 
> > The comment explains why slice_end needs to be extended, but not why
> > bs->slice_time should be changed (except that it was tuned as the result
> > of an experiment).
> > 
> > Zhi Yong: Do you remember a reason for modifying bs->slice_time?
> Stefan,
>   In some case that the bare I/O speed is very fast on physical machine,
> when I/O speed is limited to be one lower value, I/O need to wait for
> one relative longer time(i.e. wait_time). You know, wait_time should be
> smaller than slice_time, if slice_time is constant, wait_time may not be
> its expected value, so the throttling function will not work well.
>   For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s,
> slice_time is constant, and set to 50ms(a assumed value) or smaller, If
> current I/O can be throttled to 1MB/s, its wait_time is expected to
> 100ms(a assumed value), and is more bigger than current slice_time, I/O
> throttling function will not throttle actual I/O speed well. In the
> case, slice_time need to be adjusted to one more suitable value which
> depends on wait_time.

When an I/O request spans a slice:
1. It must wait until enough resources are available.
2. We extend the slice so that existing accounting is not lost.

But I don't understand what you say about a fast host.  The bare metal
throughput does not affect the throttling calculation.  The only values
that matter are bps limit and slice time:

In your example the slice time is 50ms and the current request needs
100ms.  We need to extend slice_end to at least 100ms so that we can
account for this request.

Why should slice_time be changed?

>   In some other case that the bare I/O speed is very slow and I/O
> throttling speed is fast, slice_time also need to be adjusted
> dynamically based on wait_time.

If the host is slower than the I/O limit there are two cases:

1. Requests are below I/O limit.  We do not throttle, the host is slow
but that's okay.

2. Requests are above I/O limit.  We throttle them but actually the host
will slow them down further to the bare metal speed.  This is also fine.

Again, I don't see a nice to change slice_time.

BTW I discovered one thing that Linux blk-throttle does differently from
QEMU I/O throttling: we do not trim completed slices.  I think trimming
avoids accumulating values which may lead to overflows if the slice
keeps getting extended due to continuous I/O.

blk-throttle does not modify throtl_slice (their equivalent of
slice_time).

Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-21  9:17             ` Stefan Hajnoczi
@ 2013-03-21 13:04               ` Zhi Yong Wu
  2013-03-21 15:14                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 14+ messages in thread
From: Zhi Yong Wu @ 2013-03-21 13:04 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Benoît Canet, kwolf, qemu-devel, Stefan Hajnoczi

On Thu, 2013-03-21 at 10:17 +0100, Stefan Hajnoczi wrote:
> On Thu, Mar 21, 2013 at 09:18:27AM +0800, Zhi Yong Wu wrote:
> > On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote:
> > > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote:
> > > > > But I don't understand why bs->slice_time is modified instead of keeping
> > > > > it constant at 100 ms:
> > > > >
> > > > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > > > >     if (wait) {
> > > > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > > >     }
> > > > 
> > > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment.
> > > > 
> > > > ---------
> > > >   /* When the I/O rate at runtime exceeds the limits,
> > > >      * bs->slice_end need to be extended in order that the current statistic
> > > >      * info can be kept until the timer fire, so it is increased and tuned
> > > >      * based on the result of experiment.
> > > >      */
> > > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > > >     if (wait) {
> > > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > >     }
> > > > ----------
> > > 
> > > The comment explains why slice_end needs to be extended, but not why
> > > bs->slice_time should be changed (except that it was tuned as the result
> > > of an experiment).
> > > 
> > > Zhi Yong: Do you remember a reason for modifying bs->slice_time?
> > Stefan,
> >   In some case that the bare I/O speed is very fast on physical machine,
> > when I/O speed is limited to be one lower value, I/O need to wait for
> > one relative longer time(i.e. wait_time). You know, wait_time should be
> > smaller than slice_time, if slice_time is constant, wait_time may not be
> > its expected value, so the throttling function will not work well.
> >   For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s,
> > slice_time is constant, and set to 50ms(a assumed value) or smaller, If
> > current I/O can be throttled to 1MB/s, its wait_time is expected to
> > 100ms(a assumed value), and is more bigger than current slice_time, I/O
> > throttling function will not throttle actual I/O speed well. In the
> > case, slice_time need to be adjusted to one more suitable value which
> > depends on wait_time.
> 
> When an I/O request spans a slice:
> 1. It must wait until enough resources are available.
> 2. We extend the slice so that existing accounting is not lost.
> 
> But I don't understand what you say about a fast host.  The bare metal
I mean that a fast host is one host with very high metal throughput.
> throughput does not affect the throttling calculation.  The only values
> that matter are bps limit and slice time:
> 
> In your example the slice time is 50ms and the current request needs
> 100ms.  We need to extend slice_end to at least 100ms so that we can
> account for this request.
> 
> Why should slice_time be changed?
It isn't one must choice, if you have one better way, we can maybe do it
based on your way. I thought that if wait_time is big in previous slice
window, slice_time should also be adjusted to be a bit bigger
accordingly for next slice window.
> 
> >   In some other case that the bare I/O speed is very slow and I/O
> > throttling speed is fast, slice_time also need to be adjusted
> > dynamically based on wait_time.
> 
> If the host is slower than the I/O limit there are two cases:
This is not what i mean; I mean that the bare I/O speed is faster than
I/O limit, but their gap is very small.

> 
> 1. Requests are below I/O limit.  We do not throttle, the host is slow
> but that's okay.
> 
> 2. Requests are above I/O limit.  We throttle them but actually the host
> will slow them down further to the bare metal speed.  This is also fine.
> 
> Again, I don't see a nice to change slice_time.
> 
> BTW I discovered one thing that Linux blk-throttle does differently from
> QEMU I/O throttling: we do not trim completed slices.  I think trimming
> avoids accumulating values which may lead to overflows if the slice
> keeps getting extended due to continuous I/O.
QEMU I/O throttling is not completely same as Linux Block throttle way.

> 
> blk-throttle does not modify throtl_slice (their equivalent of
> slice_time).
> 
> Stefan
> 

-- 
Regards,

Zhi Yong Wu

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-21 13:04               ` Zhi Yong Wu
@ 2013-03-21 15:14                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 14+ messages in thread
From: Stefan Hajnoczi @ 2013-03-21 15:14 UTC (permalink / raw)
  To: Zhi Yong Wu; +Cc: Benoît Canet, kwolf, qemu-devel, Stefan Hajnoczi

On Thu, Mar 21, 2013 at 09:04:20PM +0800, Zhi Yong Wu wrote:
> On Thu, 2013-03-21 at 10:17 +0100, Stefan Hajnoczi wrote:
> > On Thu, Mar 21, 2013 at 09:18:27AM +0800, Zhi Yong Wu wrote:
> > > On Wed, 2013-03-20 at 16:12 +0100, Stefan Hajnoczi wrote:
> > > > On Wed, Mar 20, 2013 at 03:56:33PM +0100, Benoît Canet wrote:
> > > > > > But I don't understand why bs->slice_time is modified instead of keeping
> > > > > > it constant at 100 ms:
> > > > > >
> > > > > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > > > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > > > > >     if (wait) {
> > > > > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > > > >     }
> > > > > 
> > > > > In bdrv_exceed_bps_limits there is an equivalent to this with a comment.
> > > > > 
> > > > > ---------
> > > > >   /* When the I/O rate at runtime exceeds the limits,
> > > > >      * bs->slice_end need to be extended in order that the current statistic
> > > > >      * info can be kept until the timer fire, so it is increased and tuned
> > > > >      * based on the result of experiment.
> > > > >      */
> > > > >     bs->slice_time = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > > >     bs->slice_end += bs->slice_time - 3 * BLOCK_IO_SLICE_TIME;
> > > > >     if (wait) {
> > > > >         *wait = wait_time * BLOCK_IO_SLICE_TIME * 10;
> > > > >     }
> > > > > ----------
> > > > 
> > > > The comment explains why slice_end needs to be extended, but not why
> > > > bs->slice_time should be changed (except that it was tuned as the result
> > > > of an experiment).
> > > > 
> > > > Zhi Yong: Do you remember a reason for modifying bs->slice_time?
> > > Stefan,
> > >   In some case that the bare I/O speed is very fast on physical machine,
> > > when I/O speed is limited to be one lower value, I/O need to wait for
> > > one relative longer time(i.e. wait_time). You know, wait_time should be
> > > smaller than slice_time, if slice_time is constant, wait_time may not be
> > > its expected value, so the throttling function will not work well.
> > >   For example, bare I/O speed is 100MB/s, I/O throttling speed is 1MB/s,
> > > slice_time is constant, and set to 50ms(a assumed value) or smaller, If
> > > current I/O can be throttled to 1MB/s, its wait_time is expected to
> > > 100ms(a assumed value), and is more bigger than current slice_time, I/O
> > > throttling function will not throttle actual I/O speed well. In the
> > > case, slice_time need to be adjusted to one more suitable value which
> > > depends on wait_time.
> > 
> > When an I/O request spans a slice:
> > 1. It must wait until enough resources are available.
> > 2. We extend the slice so that existing accounting is not lost.
> > 
> > But I don't understand what you say about a fast host.  The bare metal
> I mean that a fast host is one host with very high metal throughput.
> > throughput does not affect the throttling calculation.  The only values
> > that matter are bps limit and slice time:
> > 
> > In your example the slice time is 50ms and the current request needs
> > 100ms.  We need to extend slice_end to at least 100ms so that we can
> > account for this request.
> > 
> > Why should slice_time be changed?
> It isn't one must choice, if you have one better way, we can maybe do it
> based on your way. I thought that if wait_time is big in previous slice
> window, slice_time should also be adjusted to be a bit bigger
> accordingly for next slice window.
> > 
> > >   In some other case that the bare I/O speed is very slow and I/O
> > > throttling speed is fast, slice_time also need to be adjusted
> > > dynamically based on wait_time.
> > 
> > If the host is slower than the I/O limit there are two cases:
> This is not what i mean; I mean that the bare I/O speed is faster than
> I/O limit, but their gap is very small.
> 
> > 
> > 1. Requests are below I/O limit.  We do not throttle, the host is slow
> > but that's okay.
> > 
> > 2. Requests are above I/O limit.  We throttle them but actually the host
> > will slow them down further to the bare metal speed.  This is also fine.
> > 
> > Again, I don't see a nice to change slice_time.
> > 
> > BTW I discovered one thing that Linux blk-throttle does differently from
> > QEMU I/O throttling: we do not trim completed slices.  I think trimming
> > avoids accumulating values which may lead to overflows if the slice
> > keeps getting extended due to continuous I/O.
> QEMU I/O throttling is not completely same as Linux Block throttle way.

There is a reason why blk-throttle implements trimming and it could be
important for us too.  So I calculated how long it would take to
overflow int64_t with 2 GByte/s of continuous I/O.  The result is 136
years so it does not seem to be necessary in practice yet :).

Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20 14:28     ` Stefan Hajnoczi
  2013-03-20 14:56       ` Benoît Canet
@ 2013-03-20 15:27       ` Benoît Canet
  2013-03-21 10:34         ` Stefan Hajnoczi
  1 sibling, 1 reply; 14+ messages in thread
From: Benoît Canet @ 2013-03-20 15:27 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: kwolf, wuzhy, qemu-devel, Stefan Hajnoczi

> Now there is no oscillation and the wait_times do not grow or shrink
> under constant load from dd(1).
>
> Can you try this patch by itself to see if it fixes the oscillation?

On my test setup it fixes the oscillation and lead to an average 149.88 iops.
However another pattern appear.
iostat -d 1 -x will show something between 150 and 160 iops for several sample
then a sample would show around 70 iops to compensate for the additional ios
and this cycle restart.

Best regards

Benoît

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-20 15:27       ` Benoît Canet
@ 2013-03-21 10:34         ` Stefan Hajnoczi
  2013-03-21 14:28           ` Benoît Canet
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Hajnoczi @ 2013-03-21 10:34 UTC (permalink / raw)
  To: Benoît Canet; +Cc: kwolf, wuzhy, qemu-devel, Stefan Hajnoczi

On Wed, Mar 20, 2013 at 04:27:14PM +0100, Benoît Canet wrote:
> > Now there is no oscillation and the wait_times do not grow or shrink
> > under constant load from dd(1).
> >
> > Can you try this patch by itself to see if it fixes the oscillation?
> 
> On my test setup it fixes the oscillation and lead to an average 149.88 iops.
> However another pattern appear.
> iostat -d 1 -x will show something between 150 and 160 iops for several sample
> then a sample would show around 70 iops to compensate for the additional ios
> and this cycle restart.

I've begun drilling down on these fluctuations.

I think the problem is that I/O throttling uses bdrv_acct_done()
accounting.  bdrv_acct_done is only called when requests complete.  This
has the following problem:

Number of IOPS in this slice @ 150 IOPS = 15 ops per 100 ms slice

14 ops have completed already, only 1 more can proceed.

3 ops arrive in rapid succession:

Op #1: Allowed through since 1 op can proceed.  We submit the op.
Op #2: Allowed through since op #1 is still in progress so
       bdrv_acct_done() has not been called yet.
Op #3: Allowed through since op #1 & #2 are still in progress so
       bdrv_acct_done() has not been called yet.

Now when the ops start completing and the slice is extended we end up
with weird wait times since we overspent our budget.

I'm going to try a fix for delayed accounting.  Will report back with
patches if it is successful.

Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation
  2013-03-21 10:34         ` Stefan Hajnoczi
@ 2013-03-21 14:28           ` Benoît Canet
  0 siblings, 0 replies; 14+ messages in thread
From: Benoît Canet @ 2013-03-21 14:28 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Benoît Canet, kwolf, wuzhy, qemu-devel, Stefan Hajnoczi


The +1 was here to account the current request as already done in this slice.
Statistically there is 50% chance that it will be wrong.
I toyed adding + 0.5 add wait_time doesn't drift anymore while iops don't
oscillate.

diff --git a/block.c b/block.c
index 0a062c9..455d8b0 100644
--- a/block.c
+++ b/block.c
@@ -3739,7 +3739,7 @@ static bool bdrv_exceed_iops_limits(BlockDriverState *bs,
bool is_write,
     }
 
     /* Calc approx time to dispatch */
-    wait_time = (ios_base + 1) / iops_limit;
+    wait_time = (ios_base + 0.5) / iops_limit;
     if (wait_time > elapsed_time) {
         wait_time = wait_time - elapsed_time;
     } else {

I will let a vm run this patch for a while

Regards

Benoît

> Le Thursday 21 Mar 2013 à 11:34:53 (+0100), Stefan Hajnoczi a écrit :
> On Wed, Mar 20, 2013 at 04:27:14PM +0100, Benoît Canet wrote:
> > > Now there is no oscillation and the wait_times do not grow or shrink
> > > under constant load from dd(1).
> > >
> > > Can you try this patch by itself to see if it fixes the oscillation?
> > 
> > On my test setup it fixes the oscillation and lead to an average 149.88 iops.
> > However another pattern appear.
> > iostat -d 1 -x will show something between 150 and 160 iops for several sample
> > then a sample would show around 70 iops to compensate for the additional ios
> > and this cycle restart.
> 
> I've begun drilling down on these fluctuations.
> 
> I think the problem is that I/O throttling uses bdrv_acct_done()
> accounting.  bdrv_acct_done is only called when requests complete.  This
> has the following problem:
> 
> Number of IOPS in this slice @ 150 IOPS = 15 ops per 100 ms slice
> 
> 14 ops have completed already, only 1 more can proceed.
> 
> 3 ops arrive in rapid succession:
> 
> Op #1: Allowed through since 1 op can proceed.  We submit the op.
> Op #2: Allowed through since op #1 is still in progress so
>        bdrv_acct_done() has not been called yet.
> Op #3: Allowed through since op #1 & #2 are still in progress so
>        bdrv_acct_done() has not been called yet.
> 
> Now when the ops start completing and the slice is extended we end up
> with weird wait times since we overspent our budget.
> 
> I'm going to try a fix for delayed accounting.  Will report back with
> patches if it is successful.
> 
> Stefan
> 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-03-21 15:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-20  9:12 [Qemu-devel] [PATCH] Fix I/O throttling pathologic oscillating behavior Benoît Canet
2013-03-20  9:12 ` [Qemu-devel] [PATCH] block: fix bdrv_exceed_iops_limits wait computation Benoît Canet
2013-03-20 10:55   ` Zhi Yong Wu
2013-03-20 13:29   ` Stefan Hajnoczi
2013-03-20 14:28     ` Stefan Hajnoczi
2013-03-20 14:56       ` Benoît Canet
2013-03-20 15:12         ` Stefan Hajnoczi
2013-03-21  1:18           ` Zhi Yong Wu
2013-03-21  9:17             ` Stefan Hajnoczi
2013-03-21 13:04               ` Zhi Yong Wu
2013-03-21 15:14                 ` Stefan Hajnoczi
2013-03-20 15:27       ` Benoît Canet
2013-03-21 10:34         ` Stefan Hajnoczi
2013-03-21 14:28           ` Benoît Canet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).